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Preface 


ICCSAMA 2019, held on December 19-20, 2019, in Hanoi, Vietnam, was the sixth 
event of the series of international scientific conferences on computer science, 
applied mathematics and applications. The conference is co-organized by the 
Computer Science and Applications Department, LGIPM, University of Lorraine, 
France; the Institute for research and applications of optimization (VinOptima), 
VinTech, Vingroup; the International School, Vietnam National University, Hanoi, 
Vietnam; the Laboratory of Mathematics, National Institute for Applied 
Sciences-Rouen, France, and the Department of Information Systems Wroclaw 
University of Science and Technology, Poland. 

The aim of ICCSAMA 2019 is to bring together leading academic scientists, 
researchers and scholars to discuss and share their newest results in the fields of 
computer science, applied mathematics and their applications. These two fields are 
very close and related to each other. It is also clear that the potentials of compu- 
tational methods for knowledge engineering and optimization algorithms are to be 
exploited, and this is an opportunity and a challenge for researchers. 

For the ICCSAMA 2019 edition, the Program Committee received more than 75 
submissions. Each paper was peer-reviewed by at least two members of the 
International Program Committee and the International Reviewer board. After the 
review process, 37 high-quality papers were selected for oral presentation and 
publication in this book. The selected papers cover several topics in applied 
mathematics and computer science, and they are divided into four parts: nonconvex 
optimization, DC programming and DCA and applications; data mining and data 
processing; machine learning methods and applications, and knowledge informa- 
tion and engineering systems. Extended versions of selected papers will be con- 
sidered for publication in post-conference special issues including Journal of 
Global Optimization. 

ICCSAMA 2019 was attended by about 100 scientists and practitioners. The 
conference program is composed of four plenary lectures and one semi-plenary 
lecture of world-class speakers and the oral presentation of 37 selected papers as 
well as several selected abstracts. 


vi Preface 


ICCSAMA 2019 has created numerous interesting interactions between two 
communities computer science and applied mathematics, and we hope that 
researchers and practitioners can find here many inspiring ideas and useful tools and 
techniques for their works. Many such challenges are suggested by particular 
approaches and models presented in individual chapters of this book. 

We would like to thank the chairs and the members of International Program 
Committee as well as the reviewers for their hard work in the review process, which 
helped us to guarantee the highest quality of the selected papers for the conference. 
We also would like to express our thanks to the keynote speakers for their inter- 
esting and informative talks. Our sincere thanks go to all the authors for their 
valuable contributions and to the other participants who enriched the conference 
success. 

We wish to thank all members of the Organizing Committee for their excellent 
work to make the conference a success. 

We cordially thank Prof. Janusz Kacprzyk and Dr. Thomas Ditzinger from 
Springer for their help in publishing this book. 

Finally, we would like to express our special thanks to the main sponsor 
VinTech City, VinTech, VinGroup for their considerable support. 
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for Maximizing the Profit 
and the Compactness in Land Use 
Planing Problem 


Tran Duc Quynh™) 
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Abstract. This paper deals with a land-use planing problem in which 
the objective is to maximize the profit (or to minimize the cost) while 
ensuring the compactness. The original mathematical model is a multi- 
objective optimization problem with binary integer variables. It is then 
transformed to a single objective optimization problem. One may use 
a commercial software to solve such problem but the computation time 
is expensive especially in large scale problem. Hence, finding new effi- 
cient algorithms for the problem is necessary. Recently, two alternatives 
method based on genetic algorithm (GA) and non dominated sorting 
genetic algorithm (NSGA-II) are proposed. In this work, we propose 
a new local method based on difference of convex functions algorithm 
(DCA). The numerical results are compared with the one provided 
by GA. It shows that the proposed algorithm is much better and the 
obtained solutions are close to the global solutions. 


Keywords: DCA - Mixed integer linear optimization - Land use 
planing problem - Profit - Compactness 


1 Introduction 


Land use planing problem is an important problem because the land area is 
limited while the population is continuously increasing. The area of agricultural 
land is about 46% of the earth’s land. It may decrease and the food demand is 
increasing [10] because of the population’s augmentation. It is estimated that the 
food demand in 2050 will increase by 70% compared to the present. Therefore, 
finding a solution to optimize the use of agricultural land attired the interest 
of scientists in mathematics, computer science and agronomy. In literature, the 
researchers often formulate the problem in the form of optimization problem 
and then develop solution methods for it. In recent 20 years, many mathemati- 
cal models have been proposed. Each model considers a specific case, objective 
and constraint. We can classify the proposed models by 3 groups [10]: maximiz- 
ing the profit [3] optimizing the management of water resources [1], optimizing 
© Springer Nature Switzerland AG 2020 
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the protect of the environment and ecosystem [2]. Some research simultane- 
ously consider 2 or 3 objectives, we then have multiple objective optimization 
problems. 

In this research, we tackle a model used in [13] that is based on the one 
introduced by Jeroen et al. [4]. The aim to maximize the total profit while 
ensuring that the cells with the same land use are close as possible (compact- 
ness). The original mathematical model is a bi-objective optimization problem. 
One can transform the original model to a single objective optimization problem 
by using scalar technique. The objective of the resulting problem is the combi- 
nation of the profit and the compactness. The difficulty of the problem comes 
from the mixed binary variables. The solution method often need a large exe- 
cuting time. Thus, developing efficient local methods for it is necessary. In [13], 
the author proposed two local methods called GA and NSGA-II to solve the 
problem. The experimentation showed that NSGA-II is better than GA by 9% 
but the computation time of NSGA-II is much longer. In this work, we develop 
a deterministic method based on DC programming and DCA to solve the mixed 
integer linear optimization model in [13]. The idea is to reformulate the problem 
as a DC program by using penalty technique and then develop DC algorithm 
(DCA) for solving it. 

To evaluate the efficiency of the proposed algorithm, we consider 15 instances 
and compare the results provided by DCA and local method GA. The gap 
between the objective value obtained by DCA and the optimal value is also 
estimated. The results on simulation data show that the gap of DCA is smaller 
than 5%. It is quite good result with a local method. 

The paper is organized as follows. In Sect. 2, we state the problem and present 
the mathematical model. Section3 presents the solution method via DC pro- 
gramming and DCA. The computational results are reported in Sects. 4 and 5 
concludes the paper. 


2 Problem Statement 


We consider the mathematical model of land use planing problem that has been 
addressed in [13]. It is a variant of the one in [4]. The difference is the replacement 
of minimizing the cost by maximizing the profit and we do not use the buffer 
for the cells in borders. The problem is stated as follows: consider a rectangular 
area which has to be allocated with different land uses. First, we divide the area 
into N.M cells by N rows and M columns, the cell in row i and column j will 
be called (i,j). Suppose there are K different land uses, symbol k indicates a 
specific land use, & € 1,..., A. The following parameters are known: 


— Bjjx: the profit generated by cell (i, 7) if it is allocated to land use k. 
— Ty: the total number of cell will be allocated to land use k. 


The problem is to find the allocation such that the total profit generated by 
the considered area is the largest and the cells with the same allocated land use 
are placed close together to form a block (compactness). 
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In [4], the author proposed a mathematical model in the form of bi-objective 
linear optimization problem with binary 0-1 variables. Let x;;; be the decision 
variables which equal to 1 if cell (7,7) is used for land use k, 0 otherwise. It is 
easy to see that the total profit is expressed as: 


N M K 


Profit = S- s: oe BijrLizk 


i=1 j=l k=1 


There are some following constraints 
K 
So sign =1 V(i,3) (1) 
k=1 


Constraint (1) ensures that each cell is allocated to only one land use. 


N M 


i=1 j=1 


Constraint (2) ensures that the number of cells allocated to land use k is Ty. 
To measure the compactness, variables y;;;, are introduced. The value of y;;, 
equals to 0 if cell (2, 7) is not allocated to land use k (x;;~ = 0). In the case where 
cell (2,7) is allocated to land use k (xj, = 1) then yjx is the number of cells 
close to cell (i, 7) by row or collum, which are allocated to land use k. Variable 
Yijk Can be expressed as: 
In the case where cell (i, 7) is not on the borders. 


Yijk <4cijzn Vi,7,k (3) 


Vise < Me ige Poa tee tee. VR, 251 N=-1285 7575 M-14) 


Yijk > Vi-rgk + Vigrge + @ig—1k + Vij 41k —4-(1— VijR)WK, 2 Si < N-1,2<j5< M-1 (5) 


In the case where cell (i, 7) is on the borders but it is not a corner. 


Yijk S Vitagk + Vig-tk + Vijggtk VR,t=1,2<7 <5 M—-1 (6) 

Vie 2 Piaget ty sias Seip — BL = egy) VR = 12575 M1 (8) 

Yijk S Ciage + Cite + Viggig Vk,t=N,2<FSM—-1 (8) 

Vigh > Up Aghe Pye Pegg — 81 — gg) VR NN HT 18) 
Yigk S Ui-1gk + Vigaje t+ Vijtik Vk,2<t<N-1,j=1 (10) 

Yijk > Li-1jk + Vitae + Cijpik — 3.1 — wij) Vk, 2<i< N-1,j=1 (11) 
Yigk S Ui-1gk + Vigajk + Vij-tk Vk,2<i1< N-1,j=M (12) 

Vii = Cage gp a — 1 Ste) VE 2 St AN 1,9 = (13) 
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In the case where cell (i,j) is a corner. 


Yigk S Vigagk + Viggik Vk,t=1,j7=1 (14) 

Yisk > Bipage + Vypig — 2-1 —a2yjx) VRt= 1,9 =1 (15) 

Yijlk S Vitae + Vij-1e Vk,i=1,7 =M (16) 

Yijk > Li-igk + Vijz-1k —2-(1—aijn) Vk,i=1,7 = M (17) 
Yijlk S Vi-1tjk + Vijpie VkK,t= Nig =1 (18) 

Vik > Mtge + Zigsiy —2-(L— 2x) VR = NG = 1 (19) 

Yigk S Vi-tjk + Vij-1e VkK,i=N,j=M (20) 

ice Da edie ae 2 eg) VSN (21) 


The function that measures the compactness is given by 


N M K 


foley) = >> >> > wae 


i=1 j=l k=1 


We can see that the measurement of compactness f(x,y) is calculated based 
on the number of pair of two consecutive cells (by row or column) which are 
allocated the same land use. The aim is to maximize the compactness. 

We also need the non-negativity and binary constraints 


Lijks Yigk 2O VWi,9,k. (22) 


Hence, we obtain a multi-objective optimization problem 


max f1(z,y) = >) Do DD Bij vige 
1j=1k=1 


‘NM K 
max fo(x,y) = y Yigk 
i=1j=1lk=1 
s.t. (3) — (23) 
A technique to solve multi-objective optimization problem is to transform it to 
a single optimization one. By using a coefficient w > 0, the single objective 
optimization problem is written as follows: 


max f(x,y) = fi(x,y) + w.fe(@,y) / 5, 
" (3) (93) : YP) 


Problem (P’) is a mixed integer linear program. It can be solved by using a 
commercial software but the computation time is very long in the case of large 
number of integer variables. In [13], the author proposed two methods based on 
genetic scheme to solve the two objectives optimization problem and the single 
one. In this work, we propose a local approach based on DC programming and 
DCA. The work is motivated by the rapidity and the efficiency of DCA. 
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3 DC Programming and Solution Method 


3.1 <A Brief Presentation of DC Programming and DCA 


DC programming and DCA is backbone of non convex programming. DCA was 
first introduced by Pham Dinh Tao in 1985 and has been extensively developed 
since 1994 by Le Thi Hoai An and Pham Dinh Tao in their common works. It has 
been successfully applied to many large-scale (smooth or nonsmooth) nonconvex 
programs in various domains of applied science, and has now become classic and 
popular. In this section, we briefly present DC programming and DCA (see [5-7] 
and references therein for more detail). 

Let Ip(IR”) denotes the convex cone of all lower semi-continuous proper 
convex functions on IR”. Consider the following primal DC program: 


(Pac) a =inf{f(z) := g(z)—h(z) : z€R"}, (24) 


where g,h € Ip(IR”) and function f(z) is called a DC function (difference of 
convex functions). 

Let C be a nonempty closed convex set. The indicator function on C’, denoted 
xc, is defined by yo(z) = 0 if z € C, co otherwise. Then, the problem 


inf{ f(z) := g(z) —h(z) : « € Ch, (25) 


can be transformed into an unconstrained DC program by using the indicator 
function of C, i.e., 


inf{ f(z) := o(z) —h(z) : z€R"}, (26) 


where ¢:= 9+ xc is in Ip(IR"). 
Recall that, for h € I)(IR") and zo Edom h := {z € R"”|h(zo) < +co}, the 
subdifferential of h at zo, denoted Oh(zo), is defined as 


Oh(z9) := {€ € IR": h(z) > h(2o) + (2— %,6),V2E R"}, (27) 


which is a closed convex set in IR”. It generalizes the derivative in the sense that 
h is differentiable at zo if and only if Oh(zo) is reduced to a singleton which is 
exactly {Vh(zo)}. 

The idea of DCA is simple: each iteration of DCA approximates the concave 
part —h by its affine majorization (that corresponds to taking €* € Oh(z")) and 
minimizes the resulting convex problem (P,). 

Generic DCA scheme 
Initialization: Let z° € IR” be a best guess, 0 < k. 

Repeat 

Calculate £* € Oh(z*) 

Calculate z**1 € argmin{g(z) — h(z*) — (z—2*,€*):2e€R"} (FF) 

k+1<-—k 
Until convergence of z*. 

Convergence properties of the DCA and its theoretical bases are described 
in [5,9,11, 12]. 
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3.2. Reformulation and DC Algorithm 


To use DCA for solving (P’), we transform it into a DC program by using a 
penalty technique given in [8]. The work is based on the following theorem. 


Theorem 1. /8/ Let 2 be a nonempty bounded polyhedral convex set, f be a 
finite DC function on 2 and p be a finite nonnegative concave function on 22. 
Then there exists no > 0 such that for 7 > no the following problems have the 
same optimal value and the same solution set 


(Py) a(n) = min{ f(z) + n-p(2) : 2 € 2}, 
(P) a = min{ f(z): z € 2,p(z) < 0}. 


Proof. see [8]. 

Denote by L the number of variables of problem (P’), L = 2.N.M.K and 
S={z=(2,y)€IR” s.t. (3) — (23)}. Set D is the relaxed domain of S, say 
D={z=(a,y)€R” st. (3)— (22); 0<2< 1}. 


M 


N K 
We consider function p(z) = >> Y> (1 = vizn) Lije- It is clear that p(z) > 


; NM K NM K 
min —f(z)=—)7 Do DD Bijetigk — wD) DD DD Yage 
(P’) % i=1 j=1k=1 i=1j=1k=1 
St. 2€ 
p(z) <0. 


By using Theorem 1, Problem (P’) is transformed to the equivalent one 


a) Seren Om 


where 77 is a sufficiently large number. It can be seen that (P.,) isa DC program. 
The DC decomposition F(z) = G(z) — H(z) is described as 


N M K N M K 
G2) =— DTD TD Bisnaign — wD) Dd vise 
t=1 g=1 k=1 t=1 g=1 k=1 
N M K 
Hlzj=n>_ >) > (en — tun). 
i=1 j=1 k=1 


ZH =0 Vi, j,k. oo 


ome =2.n.2ijk —N Vi, j,k. 
OYijk 
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DCA applied to land use problem (P.q) can be described as follows: 

DCA-LU 

Initialization 

Let € be a sufficiently small positive number. Set @ = 0 and the initial point 
pe eR: 

Repeat 

Calculate Gf, = noe =2.n.rizn— i,j,k. 

Solve the linear program 


: N M K ’ N M K 
min — )7 >) do (-Bije — Bij4)Pigk — WD DT DE Yage 


i=1 j=l k=1 i=1 j=1k=1 


to obtain 2/1}, 


€— +1 

Until ||z4t! — z4|| < € or ||F(2**1) — F(z4)|| < e. 

In the case where the solution provided by DCA does not satisfy the integer 
constraints, we change the value of penalty coefficient 7 and the initial point and 
then rerun DCA-LU. We obtain a multi-restart DC algorithm as follows: 

ResDCA-LU 

Initialization 

Let 7° be the initial value of the penalty coefficient. Set ¢ = 0 and the initial 
point 2° = (2°, y°) ¢ IR”. 

Repeat 

Launch DCA-LU with the initial point 2‘ to obtain 24+! = (#*+1, y**"). Set 
IntVar = 2}, 

If ae is not integer then reset ae by the rule 


ua {0 if MOD Wijk 
agk 1 otherwise Sie 
Aft! = 10 «nf 

£— +1 


Until IntVar is integer. 


4 Numerical Results 


To evaluate the efficiency of the proposed algorithm, we compare the result 
provided by ResDCA-LU and GA. Because of the lack of the real data, we use 
15 simulation instances by changing the size of the area and profits generated by 
each land use. There are 3 sizes (N = 10, M = 10), (N = 20, M = 20) and (N = 
50, M = 50). For all instances, we suppose that there are 4 land uses (K = 4). If 
cell (i, 7) is suitable for land use k then the corresponding profit B;;, = cof >1 
and Bj;, = 1 otherwise. Five cases corresponding to (cof = 1.5;2;3;4;5) are 
investigated. Assume that the top left corner, the top right corner, the bottom 
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left corner, the bottom right corner are suitable for the first land use, the second 
land use, the third land use, the fourth land use respectively. Both algorithms 
ResDCA-LU and GA are implemented in Matlab 2017, run on CPU Intel core i5 
2.8 GHz, RAM 8GB. The free software CVX is used to solve the linear programs. 
The setting for GA is similar to the one in [13]. 

We run ResDCA-LU with the initial penalty coefficient of 200. The initial 
point for the first run of DCA is z° = 0, parameter w is fixed 0.5 for all runs. For 
each instance, we run 10 times of GA and pick up the highest quality solution 
to compared with ResDCA. Table 1 presents the results given by ResDCA-LU 
and GA. In the table, some notations are used: 

© Size: the size of the area. It is given by the number of rows and columns. 

© Ty; the number of cells being allocated to land use k. 

© cof: the coefficient reflects the suitability of cells for land uses. It is 
described in the first paragraph. 

° valpca: the objective value given by ResDCA-LU. 

© RN: the number of rerunning DCA in ResDCA-LU. 

© LB: the objective value of the relaxed problem that is obtained from prob- 
lem (P’) by removing integer constraints. It is a lower bound of the optimal 
objective value. 

© Tpca: the executing time in seconds of ResDCA-LU. 

© Gpcoa: the gap of DCA. It is calculated by Gpc,4 = 100| 272ga—28 |, 

© valg: the best objective value given by GA. 

© Taga: the executing time in seconds of GA. 

© Gga: the gap of GA. It is calculated by Gg4 = 100| 24ea— Le |, 


Table 1. Results provided by ResDCA and GA 


Size T1; T2; T3; T4 cof valpca LB RN|Tpca|Gocal|valea (Tea Goa 
10 x 10} 20; 30; 30; 20 1.5 —298.0 —316.6/1 43.7 | 5.9 —212.0| 161.8] 33.0 
10 x 10} 20; 30; 30; 20 2 —345.0 —360.3 | 0 25.2 |4.2 —245.0) 162.8} 32.0 
10 x 10} 20; 30; 30; 20 3 —435.0 —450.0 | 3 82.9 | 3.3 —302.0| 163.3] 32.9 
10 x 10} 20; 30; 30; 20 4 —525.0 —540.3 | 3 85.3 | 2.8 —376.0| 161.9} 30.4 
10 x 10} 20; 30; 30; 20 5 —615.0 —630.3 | 3 81.2 | 2.4 —438.0| 163.3] 30.5 
20 x 20/80; 120; 120; 80 1.5 | —1291.0) —1322.8|0 158.0 | 2.4 —748.0| 660.3} 43.5 
20 x 20/80; 120; 120; 80 2 —1471.0) —1502.5|0 119.5 | 2.1 —827.0| 663.4] 45.0 
20 x 20/80; 120; 120; 80 3 —1829.0) —1862.5|3 473.9 1.8 —993.0| 657.6 | 46.7 
20 x 20/80; 120; 120; 80 4 —2193.0} —2222.5/0 154.6 | 1.3 —1166.0) 667.4) 47.5 
20 x 20/80; 120; 120; 80 5 —2553.0) —2580.5|0 158.0) 1.1 —1344.0) 667.0) 47.9 
50 x 50/500; 750; 750; 500|1.5 | —8388.0) —8489.8/1 2390.0 | 1.2 —4314.0 | 9304.7 | 49.2 
50 x 50) 500; 750; 750; 500 | 2 —9517.0) —9614.5/1 2387.9 | 1.0 —4687.0 9277.7) 51.3 
50 x 50) 500; 750; 750; 500|3 —11767.0| —11864.5 | 1 2245.0 | 0.8 —5479.0 9323.1 | 53.8 
50 x 50/500; 750; 750; 500| 4 —14010.0| —14114.5 |3 3229.2 | 0.7 —6244.0 | 9154.1} 55.8 
50 x 50) 500; 750; 750; 500|5 —16260.0 | —16364.6 | 3 2901.0 | 0.6 —7033.0 9259.1 | 57.0 
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From the results, we observe that: 


— ResDCA-LU provides an integer solution for all instances although DCA-LU 
is a local algorithm and works on continuous domain. 

— The number of rerunning DCA-LU is less than or equal to 3. In some cases, 
It does not need to recall DCA-LU. 

— The quality of solution given by ResDCA-LU is much higher than the one 
furnished by GA. The DCA’s solutions are very close to the global optimal 
solutions. The gap is smaller than 3% for almost instances (12/15 instances). 
We can consider the obtained solutions as a global solution. 

— ResDCA-LU is much faster than GA. The executing time of GA is about 4 
times of the executing time of DCA. 


Figure 1 presents the gap provided by ResDCA-LU and GA. The gap of 
ResDCA-LU decreases when the size of the problem is augmented and the gap 
of GA increases. It reflects that ResDCA-LU is more efficient for larger scale 
problems. 


—™@— Gap by GA 


—®— Gap by OCA 


123 45 6 7 8 9 101112 131415 


Problem 


Fig. 1. The gaps by DCA and GA. 


5 Conclusion 


In this paper, we investigate a mixed integer linear model for land use planning 
problem in which the objective is to maximize the combination of the profit and 
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the compactness. A local algorithm based on DC programming is proposed by 
using the reformulation and exact penalty techniques. The new algorithm is com- 
pared with a genetic algorithm (a recent stochastic local algorithm). The exper- 
imentation shows that the results are promising. For 15 simulation instances, 
DCA dominates GA for both objective value and executing time. The solutions 
provided by DCA are very close to the global optimal solutions. The limita- 
tion of this research is only the lack of results on real data. In future work, we 
plan to investigate more deeply DCA by considering some others data scenarios, 
combine DCA with a global scheme to globally solve the problem, or develop a 
variant of the existing model by integrating some others criterion. 
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Abstract. The paper deals with a transportation network protection 
problem. The aim is to limit losses due to disasters by choosing an opti- 
mal retrofiting plan. The mathematical model given by Lu, Gupte, Huang 
[11] is a mixed integer non linear optimization problem. Existing solu- 
tion methods are complicated and their computing time is long. Hence, 
it is necessary to develop efficient solution methods for the considered 
model. Our approach is based on DC (difference of two convex functions) 
programming and DC algorithm (DCA). The original model is first refor- 
mulated as a DC program by using exact penalty techniques. We then 
apply DCA to solve the resulting problem. Numerical results on a small 
network are reported to see the behavior of DCA. It shows that DCA is 
fast and the proposed approach is promissing. 


Keywords: DC programming - DC algorithm - Penalty function - 
Transportation - Retrofitting - CVaR 


1 Introduction 


In a transportation network, on roads the bridges are built to cross rivers or 
places with uneven terrain. Due to long-term use or outdated construction struc- 
tures, these bridges are at risk of serious damage or collapse when natural dis- 
asters occur. Once the bridges are damaged as a result of extreme phenomena, 
they will lead to economic and social losses due to the cost of repairing and 
restoring. Moreover, the transportation network is affected by repair activities. 
These losses can be avoided or reduced if the risk bridges are identified and eval- 
uated, and thus a proactive implementation strategy can be proposed. However, 
due to limited resources, it is not possible to retrofit all completed bridges in 
practice. So there should be a plan to improve bridges in the direction of priority 
to have economic efficiency. Choosing which risk bridge to retrofit should con- 
sider the impact on other risk bridges in the transportation network because of a 
© Springer Nature Switzerland AG 2020 
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change in redistribution of traffic flows in the network. Therefore, it is necessary 
to consider strategies for retrofitting bridges at the network level. 

Network-based bridge retrofitting problem is a general transportation net- 
work protection problem, and it can be divided into two broad categories, 
depending on whether bridges are considered as links or as paths. Therefore, 
in essence, the problem of transportation network protection is a network design 
problem. Typically, a network design is a bi-level mathematical optimal model. 
The upper level problem involves the retrofit decisions that are optimal for the 
best social wellfares while the lower-level one is concerned about the behavior of 
network users, which often present demand performance equilibrium. 

Scenarios of natural phenomena are considered to be included in the trans- 
portation protection problems. Because we do not know for sure which scenario 
will occur, a method that can consider a lot of possible scenarios should be devel- 
oped such as stochastic programming (SP) [10] or robust optimization (RO) 
method [1] to take into handle all scenarios. Stochastic programming methods 
take into account the expectation of a series of all scenarios. So it is suitable for 
problems with the goal of achieving long-term economic efficiency. However, it 
does not work well for extreme events. Therefore, when extreme events occur, 
the network will be affected. Meanwhile, RO methods consider the worst cases 
with low probability of occurrence and often offers costly solutions. Thus, it can 
be seen that SP and RO methods are not the best methods to consider the 
change of risk problem. 

In [11], Lu, Gupte and Huang developed a mean-risk two-stage stochastic 
programming model that is more flexible in handling risks in a favorable way 
when resources are limited. The first stage minimizes the retrofitting cost by 
making strategic retrofit decisions whereas the second stage minimizes the travel 
cost. The conditional value-at-risk (CVaR) is included as the risk measure for 
the total system cost. The considered model is equivalent to a nonconvex mixed 
integer nonlinear program (MINLP), where the travel cost for bridge links is a 
nonlinear and non-convex function of retrofit decisions. According to [2], noncon- 
vex MINLPs can be very difficult to solve. In [11], the model was solved by the 
Generalized Benders Decomposition method [3]. The authors derived a convex 
reformulation of the second-stage problem to overcome algorithmic challenges 
embedded in the non-convexity, nonlinearity, and non-separability of first- and 
second-stage variables. Thus, the model of the transportation protection problem 
is formulated as a convex mixed integer nonlinear program (CMINLP). 

In [11], the authors proposed a method called generalized Benders decompo- 
sition to solve (CMINLP). We also use a commercial software for solving it but 
the executing time is quite long even for a small network. Therefore, developing 
efficient solution methods for CMINLP is still a challenge. 

In this work, we introduce a new alternative solution method based on the 
mathematical technique in non-convex optimization, namely, DC programming 
and DC algorithm in conjunction with the use of the penalty function technique 
for solving Problem (CMINLP). This technique has been successfully applied to 
many non-convex optimization problems and showed the efficiency in particular 
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for large-scale problems [5,8,9, 12]. We tested on a nine-node network and found 
the algorithm running very fast. Moreover, we analyze the factors affecting the 
convergence time and optimal value of the DC algorithm such as choosing penalty 
functions, penalty parameters, starting point. 

The structure of the paper is organized as follows. After the introduction 
section, we present the problem description in Sect. 2. Section 3 introduces the 
solution method. Experimental results are presented in Sect.4. The conclusion 
is showed in the last section. 


2 Problem Description 


In this section we redescribe the model presented in [11]. This model focuses 
on transport network protection to prevent against extreme disasters such as 
earthquakes. 


2.1 Parameters and Variables 
To describe the problem, we use the following notations: 


A transportation network with the set of nodes N and the set of directed arcs 
(or links) A, denoted by G = (N, A); 

R: the set of origins in the network; 

S: the set of destinations in the network; 

OD: the set of network origin-destination (O-D) pairs; 

d™* € R,: the given travel demand between O-D pair (r,s), (r,s) € OD; 

A (Ac A,A # 0): the set of arcs that are directedly affected by hazards, 
primarily including risk bridges; 

Ca: the practical capacity of arc a; 

Hf: the finite set representing a list of retrofit strategies that can be applied 
to at-risk bridges to mitigate the adverse effects caused by future disaster 
events; 

be: the retrofit cost for a € A with strategy h; 

bo: the total budget is used for retrofitting bridges; 

kK: the set of hazard scenarios which can happen to the network; 

pr € (0,1): the given probability of scenario k, k € K; 

6: the ratio of post-disaster arc capacity to the full arc capacity, with each 
k ¢ K and for every a€ Ah € H, 6”* € (0, 1]. When a disaster occurs, the 
post-disaster capacity of arc a € A that has been retrofitted with strategy 
h € H equals c,6"*; 

6: the experimental data; 

yy: the parameter converts the travel time into monetary value; 

tog: the parameter indicates the travel time in case of the free-flow-rate of 
are a. 
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We use some variables as follows: 


ul: the binary variable, takes a value of 1 if using strategy h for arc a and 0 
otherwise, for every a € A,h € H; 

x"®-k: the flow on arc a corresponds to the (r,s) pair for scenario k, for every 
a€ A, (r,s)€ ODandke K; 

v*: the total flow on are a € A, and vk = i (r,8)€OD xt®* for all a € A; 


q’***: the travel demand is not satisfied for the O-D pair (r,s). 


The model allows for post-disaster travel demand that are not satisfied for a 
variety of reasons, such as turning off certain routes, increasing traffic congestion 
in the network, etc. 


2.2 Mathematical Model 
Let U be the set defined by 


U:= {ee (oa Tabavectitucel, (1) 


heH 


For the k*” scenario, let f*(u) = b7u + Q*(u) be the total cost function, 
where Q*(u) is the optimal value for the total travel cost, given the retrofitting 
vector wu. 

The two-stage SP is as 


(2-stage SP): min ‘2 pef*(u) = min bb u + Mis prQ*(u) subject to u € U. (2) 
keK keK 


For the k“” scenario, the recourse function is defined as 


Q*(u)= miny > ote+M SD gt (3) 
Te GEA (r,8)EOD 
. k Cal rs,k 
= min > toa Ug +6 4\ + S- q” (4) 
wea ee a (u) (r,8)EOD 
st. vk = ye ars* Ya e A, (a*,g*) eX. (5) 
(r,s)EOD 
where 
bk \4 
t® = tog |1+ (a) | (Bureau of Public Records function [16]) 


is the arc travel time per unit flow, and 


ak (u) = Ca neH Oat ae A (6) 
Ca ae A\A i 
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The objective function (3) consists of two terms. The first term is total travel 
cost. The second term is included to represent the penalty cost for unsatified 
demand. The set X is defined as: 


X=¢(2,q)>(0,0)| S> af- So aif+qr?=d™ V(r,s) © OD, (7) 


Er, J)EA EGr)EA 
So ah- SO agg’ =-d™ V(r,s) € OD, (8) 
T(S,j)EA FG, 8s)EA 


S> af- Sat =0 V(r,s) € OD,t € N\ {r,s} >. (9) 


For each pair (r,s), Eqs. (7) and (8), respectively, allow a slack of g”* in the flow 
balance at r and s to solve unsatisfied demand, whereas the preservation of flow 
at other nodes in network is shown by Eq. (9). 

The recourse function Q*(u) is a nonlinear optimization problem in (3)—(5) 
for each scenario k. This problem is non-convex because of presence of the terms 
tkv* in the objective function and the equality constraints defining t* are non- 
linear. In [11], for every u € U, the authors derived a reformulation to obtain 
a convex program and there is a separation of variables between the first and 
second stages. 

To reformulate the problem, the following inequality is added by introducing 
an auxiliary second stage nonnegative continuous variable y* for each a € A, 


ae) 
yk > (vs) ; We. (10) 
(co Ree whoa) 


Hence, we have 


k —_ : k k rs,k 
C= mit AD tole renl eM >) 4 (41 
; acA (r,s)EOD 
s.t. (5), (10). (12) 


According to [11], the recourse function Q*(u) can be formulated as: 


Qu) = amin, 7D toa [on + Sue] MST ah (13) 
eeeaen sy 2 acA (r,s)EOD 
st. vk = S- gtk Woe A, (a og") EX (14) 
(r,s)EOD 
(vk)? < cf So wh yk = SY yhk Vac A (15) 


heH heH 
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wiht < ct(ghh\* bk Whe Hae Ad (16) 
5 
O<yrF< Casa yh O< wh <cychu® Whe Hae A (17) 


(on) ais 


where ¢q is a positive constant large enough such that ¢qcq is an upper bound 
on the travel flow of link a, for every a € A. 

This proposition allows linear separation of the first stage variable u © U 
from the second stage variables. 

According to [11], the mean risk problem with a-level is a convex MINLP. 


: T k k rs,k 
unin, (1+ A)b ut S- pe >> toa [us + dyk] + M S- q 
9,2, Y,W kek acA (r,s)EOD 
1 
em (/ +—— So ns) (CMINLP) 
l-a 
kek 
subject to ucU,z* >0 VEE K (18) 
2®>7 5° toa [vk + dy] + M S- g’*¥—g VRE K 
acA (r,s)EOD 
(19) 
(14) -(17) Vk e kK, (20) 


where X is a predefined weighting factor. The objective of the problem is to 
minimize the total cost of retrofitting bridges, expected travel cost, unsatisfied 
demand penalty and the risk term. 


3 Solution Method 


This section introduces a new alternative solution method based on the math- 
ematical technique in non-convex optimization, namely, DC programming and 
DCA for solving Problem CMINLP. This technique has been successfully applied 
to many non-convex optimization problems and showed the efficiency in partic- 
ular for large-scale problems [5,8,9, 12]. 


3.1 DC Programming and DC Algorithm 


DC Programming and DCA constitute the backbone of smooth/nonsmooth non- 
convex programming and global optimization. They were introduced by Pham 
Dinh Tao in 1985 in their preliminary form and have been extensively developed 
by Le Thi Hoai An and Pham Dinh Tao since 1994. DCA has been successfully 
applied to real world non-convex programs in different fields of applied sciences 
(see e.g. [5,13,14] and the references therein). DCA is one of rare efficient algo- 
rithms for non-smooth non-convex programming which allows solving large-scale 
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DC programs. Although DCA is a continuous approach, it has been efficiently 
investigated for solving nonconvex Linear/quadratic programming with binary 
variables via exact penalty techniques [4]. 

For a convex function f defined on R” and a € domf := {x € R"|f(x) < 
+oo}, Of (xo) denotes the sub-differential of f at xo that is 


Of (xo) = {y € R” | f(a) > f(ao) + (x - ro, y), Vx ER}. 


The sub-differential Of(a 9) is a closed convex set in R”. It generalizes the 
derivative in the sense that f is differentiable at xo if and only if Of(xo) is 
reduced to a singleton that is exactly {f’(xo)}. 

A general DC program is of the form 


inf { f(a) := g(a) — h(x)|a ER}, (Pac) 


with g,h € Ig(R”), the set of all lower semi-continuous proper convex functions 
on R”. Such a function f is called DC function, and g,h are its DC components. 
A generic DCA scheme is shown as follows: 

Initialization: Let 2° € R” be a good guess, k = 0; 

Repeat 


e Calculate y* € dh(x*); 
e Calculate x*+! by solving the convex problem 


min { g(x) — h(x*) — cs gi, y*) |x E R"} : (Px) 


k=k-+1; 

Until convergence of x". 

Each DC function f has infinitely many DC decompositions which have cru- 
cial implications for the qualities (speed of convergence, robustness, efficiency, 
globality of computed solutions, ...) of DCA. 

We now present the results of the penalty technique presented in [7] relating 
to exact penalty techniques in DC programming developed in [6]. 

Let K be a nonempty bounded polyhedral convex in R” and f is a DC 
function. We consider the general 0 — 1 problem (GZOP) in the form: 


min {f(x)|" € K;x € {0,1}}. (GZOP) 
Thanks to the next theorem, we can reformulate a combinatorial optimization 
problem as a continuous one. 


Theorem 1. /7/ Let K be a nonempty bounded polyhedral convex set in R”, f 
be a finite DC function on K and p be a finite nonnegative concave function on 
K. Then there exists tp) > 0 such that for all t > to the following problems have 
the same optimal value and the same solution set: 


(F:) a(t) = min{ f(x) + tp(x)|2 € K} (21) 
(P) a=min{f(x)|x € K, p(x) < 0}. (22) 
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Now, we are able to formulate (GZOP) as a continuous optimization problem. 
Let p be the finite function defined on K by 


2) = S/ min{2;, 1—a;}. 
i=1 


It is obvious that on the set K’ = KN [0,1]", p is nonnegative and concave 
function. Furthermore, we have 


{x € K|x € {0,1}"} = {2 € K"|p() = 0} = {2 € K"|p(a) < 0}. 
Therefore, the problem (GZOP) can be rewritten as 


min{ f(x)|z € K', p(x) < 0}. 


With a sufficiently large number t, from Theorem 1 it follows that the last 
problem is equivalent to 


min{ f(x) + tp(x)|a € K’}. 


3.2 DCA for CMINLP 

Now, let us get back to the original problem CMINLP. 

Set NV = |A|.|H| and T = |A|.|H| + 2|A||K| + |OD||K| + 1+ |K] + 
2|Al || |H]. 

Let D C RT be the set defined by (18)-(20), D’= DN ((0. i’ se Ree 
Set 


(1+ A) db? u 4 tS pe >> toa [ v, r+ dyk] ]+M yo a 
kek acA (r,s)EOD 
1 k 

olecsEm) : 
kek 
T-NV 

=(1+A)b7ut So airi =f (u,r) =f (@). 
i=1 


NV 
Let pi (%) = ” min {%;,1—2%;} and po (%) = >> %;(1—%;) be two functions 


i=1 
defined over Dp’. 
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Then according to Theorem 1, the problem is equivalent to 


min {F (z) = f ()+tp(z): ze D’} 


with a sufficiently large number t and p() = p; (%) or p(®) = po (Z). 
We have a DC decomposition 


F(z) =9(@)-h(@) 


where g (%) = yp(#) and h (x) = —f (%)—tp (%). Here yp stands for the indicator 
function of D: yp(%) = 0 if « € D, yxp(®) = +00 otherwise. 

The DC algorithm solves the problem as follows: 

Initialization: Let 29 € R? be a good guess, k = 0; 

Repeat 


e Compute 9* € dh(z*) = {—f'(z*) — tp’ (z*)} 
With p(%) = pi(Z) , we have 


_— Lf Xe >O0. 

t—(1 +A) if 205 py ony 
—t-(14+A)bh if F< 0.5 ; 
yp = —a¢ if 2=(NV +1),...,T 


and with p(%) = po(Z), we have 


Ye = 


_,  Jt(2zf—1)-(1+A)be if €=1,...,NV 
He Vegi meg if 2=(NV+1),...,T" 


e Take z**1 € Ah(y*) 


E+! € argmin {9(z) — h(z*) — (7 - 2", y*) |e € D'S (CNLP) 
=argmin {—(z,y*)|z € D'}. 
Until convergence of z*. 


The problem (CNLP) is convex programming with the objective function as 
a linear function. It can be solved by CVX Solver. So instead of solving a discrete 
problem we will solve a series of continuous problems to obtain the solution. 
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4 Experimental Results 


Fig. 1. Nine-node network [11] 


We tested the proposed algorithm on the nine node network described in Fig. 1, 
which is used in [11]. It consists of nine nodes (|N| = 9), 24 directional links 
(|A| = 24), and 72 O-D pairs (|OD| = 72). There are three bridges, labeled as 
A, B, and C, on both directions on the network. These bridges are susceptible 
to seismic disasters. There are 6 links that are directly affected by the passing 
bridges, ie. A = {a1,---ag} = {5,6, 11,12, 21,22}, |A] = 6. Let the set K = 
{1,2,3,4,5,6} and each scenario k € K, we randomly generated pz € (0,1). We 
consider five strategies, denoted as hi — hs, we randomly generated 6”* € (0, 1]. 
Table 1 reports the ratios for two scenarios. 


Table 1. Some sample values of 6”** for fixed scenarios k = 1,2. 


Strategy 2 Strategy 
haha hs has | a hats has 
link 5 0.15 0.4 0.4 0.6 1 lnk 5 0.03 0.4 0.3 0.4 0.9 
link6 0.15 0.4 0.4 0.6 1 lnk 6 0.03 0.4 0.3 0.4 0.9 
link 11 0.25 0.55 0.55 0.85 0.85]} link 11 0.4 0.4 0.4 0.65 0.65 
link 12 0.25 0.55 0.55 0.85 0.85]} link 12 0.4 0.4 0.4 0.65 0.65 
link 21 0.18 0.43 0.43 0.77 0.77]) link 21 0.07 0.23 0.23 0.57 0.57 
link 22 0.18 0.43 0.43 0.77 0.77]| link 22 0.07 0.23 0.23 0.57 0.57 


Link 


Other input parameters related to the algorithm are given as follows: 


(Ca)1x24 = 10? x [10 12 14 16 14 12 16 15 12 14 18 12 
14 13 14 16 10 14 18 12v14 18 16 14]; 


h — {hh hs hi hs hy 
(bj )1x30 = [ay ae , A, Ag 379° 5 GQ° 4° Ag yt , ag*| 


= 10's PS11L5295 15215995 05115925 
1.51152251511.522.5 1.5 12.5 2 2.5); 
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(d”*)1x72 = [20 10 30 20 15 20 20 15 10 12 18 10 16 14 20 12 10 20 14 16 12 20 
10 18 14 10 20 12 14 18 20 10 12 14 16 18 12 20 14 16 18 10 20 14 15 20 14 
18 20 15 18 20 10 14 18 20 10 14 16 12 20 18 15 14 18 12 14 20 12 18 14 10); 


(toa)ix2a = [1 1.5 21.2 1.31.2 1.41.2 1.61.3 1.2 1.2 
11.5 2 1.21.3 1.2 1.4 1.2 1.6 1.3 1.2 1.2); 


Sa = [6ay3°** »Sag] = [10 12 10 8 14 12], M =10", by) = 15 x 10°. 


Other parameters taken from [11] are \ = 1, a = 0.7, 6 = 0.15, y = 10°. 

We take two starting points 7) = (u,r) = (1,0) and x = (u,r) is a point 
as follows: for every a € A,u = 1 and u! = 0 for h 4 hy and r = 0. The 
stop condition of the algorithm is ||2*t! — x*\|2 < e€ with e = 10-3. DCA 
is implemented in Matlab 2017b with the number of variables and constraints 
respectively 3008 and 2221. We tested the nine-node instances on a computer 
with 8 GB RAM and Intel(R) Core(TM) i5-8400@2.80 GHz processor under 
Windows 10 pro environment. The results are shown in the Table 2. 

A lower bound (LB) is calculated by solving a relaxation of the original 
problem in which the integer variables are ignored. This value calculated, say 
64488 x 107, is used to compute GAP = (Obj.Value — LB) x 100%/LB. 

From the results, we can see that: 


— DCA always provides a feasible solution although it is a local algorithm and 
works on the relaxed domain. 


Table 2. The results for different cases 


Pt. p(z)|t |Ini. |Obj. |GAP% |TCPU | Iter.|P-f. t Ini. Obj. |GAP% |TCPU | Iter. 
point | value (secs) p(z) point value (secs) 

pi(z) |10°|/2q |65088 | 0.9304] 0.69 2 |pa(e) |10° | x9 65088 | 0.9304/0.67 2 
10+ 65088 | 0.9304] 0.90 3 10! 65088 | 0.9304/0.72 2 
10? 65088 | 0.9304] 1.15 4 10? 65088 | 0.9304/1.67 |7 
108 65088 | 0.9304] 0.85 3 108 65088 | 0.9304/1.33 |5 
104 65088 | 0.9304| 0.87 3 104 65088 | 0.9304/1.64 |7 
10° 65088 | 0.9304] 0.90 3 10° 65088 | 0.9304/1.43 [5 
10° 65088 | 0.9304] 0.84 3 10° 65088 | 0.9304/1.55 |5 
107 65088 | 0.9304| 0.82 3 107 65088 | 0.9304/1.46 [5 
o® 65088 | 0.9304/ 0.81 3 108 65088 | 0.9304/1.19 /|4 

pi (2) 0° | xy 65088 | 0.9304] 0.68 2 |po(@) | 10° | ay 65088 | 0.9304/0.66 2 
10+ 65088 | 0.9304] 0.72 2 10+ 65088 | 0.9304/0.72 |2 
10? 65088 | 0.9304| 0.67 2 10? 65088 | 0.9304/1.05 /|4 
0? 65088 | 0.9304| 0.67 2 103 65088 | 0.9304/0.66 2 
ot 65088 | 0.9304] 0.70 2 104 65088 | 0.9304/0.71 2 
10° 68088 | 5.5824| 0.95 3 10° 68088 | 5.5824/0.93 3 
10° 71088 | 10.234 | 0.73 2 10° 71088 | 10.234 |0.73 2 
0” 71088 | 10.234 |0.70 2 107 71088 /10.234 |0.71 2 
08 71088 | 10.234 | 0.69 2 108 71088 /10.234 |0.67 2 


Obj. value (x 107) 
TCPU: Total CPU time 
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— The computing time is good. DCA needs about 1 second to solve a problem 
of 3008 variables and 2221 constraints. 

— The small GAPs show that the solutions obtained by DCA are good. 

— The impact of the change of penalty functions is not seen but the influence 
of the starting points is clear. The obtained solutions with the starting point 
xq are stable. It does not depend on the penalty parameter. 


5 Conclusions 


In this paper, we proposed a new alternative method based on DC programing 
and DCA for a transportation network protection problem. The exact penalty 
technique is used to reformulate the original model and overcome the difficulties 
due to integer variables. The proposed algorithm was tested on a small network 
with the structure being similar to the one used in [11]. The impact of penalty 
parameter, penalty functions and starting point was reported. The results show 
that the first starting point is better and the algorithm is rapid. In future works, 
we may combine DCA with another method, for instance, branch and bound to 
globally solve the problem. The experimentation for larger scale setting should 
be investigated. 
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Abstract. Train scheduling plays an important role in the operation of 
railways systems. This work focuses on a model of scheduling in which 
one minimizes the total travel time of trains in a single track railways 
network. The model can be written in the form of a mixed 0-1 linear 
program which has the worst case exponential complexity to calculate 
the optimal solution. In this paper, we propose a computationally effi- 
cient approach to solve the train scheduling problem. Our approach is 
based on a so-called Difference of Convex functions Algorithm (DCA) to 
provide good feasible solutions with finite convergence. The algorithm is 
tested on three different railway network topologies including one topol- 
ogy introduced in [18] and two practical topologies in Northern Vietnam. 
The numerical results are encouraging and demonstrate the efficiency of 
the approach. 


Keywords: Train scheduling - Penalty function - DC Algorithm 


1 Introduction 


Railways have advantage over the roadways in that they can carry a large num- 
ber of passengers and large or heavy freight loads to long distances. It becomes 
an essential pubic transport in most countries. Among many problems arising in 
operating a railway system, train scheduling is critical to reduce costs, increase 
profits or improve service quality. It generates train timetables to optimize total 
cost (time cost or financial cost) and satisfy some given conditions such as passen- 
ger demands, investment capital, time resource, etc. Train scheduling is usually 
classified into two groups [18]: line planning and scheduling generation. The for- 
mer determines frequencies, routes, and scheduled times at each stop while the 
later finds the departure time and the arrival time of each train at sidings or 
stations. 

Szpigel (1973) can be considered as a pioneer in studying train schedul- 
ing problems. The author modeled a problem of minimizing the travel time of 
trains on single track line to a job shop scheduling problem then used a branch- 
and-bound technique to solve it [31]. Afterwards, many researchers focused on 
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two aspects: modeling and solution methods for a given problem. For modeling, 
practical problems were normally formulated into the form of a mathematical 
optimization problem such as an integer program [2-6] a mixed-integer program 
(1,7, 10-15, 18, 25,26], a multi-objective linear programming [8,11]. This kind of 
problems is NP-hard. Finding a solution method (exact or heuristic) for them is 
a challenging mission. Branch-and-bound techniques were usually used to get the 
optimal solution [14, 16,31]. However, it takes much time to get the optimal solu- 
tion in the case of large-scale and complex instances. Thus, heuristic approaches 
were often proposed to find feasible solutions, for instance, the priority-rule- 
based heuristics [1,9,19, 28,28], backtracking search [1], look-ahead search [30], 
and meta-heuristic algorithms [14, 15,17]. 

Each solution method above furnishes different feasible solutions. Its effi- 
ciency depends on the structure of formulations, the network topology, the size 
of instance, etc. It is worthy to have a new solution method that may find a 
good feasible solution. If this solution is not optimal, it is a good upper bound 
in the schema of branch-and-bound algorithms in order to accelerate the time 
of computing the optimal solution. Our contribution is to propose a method 
finding such a good feasible solution. It is based on DC (difference of two convex 
functions) programming and DCA (DC Algorithms) that has been efficiently 
applied to real world non convex programs in various fields [22,23]. 

Obviously, we study a typical model that was proposed by Higgins et al. 
[14]; and modified by Karoonsoontawong and Taptana [18]. It is formulated in 
the form of a mixed 0-1 linear programming (MILP). By employing the exact 
penalty method, we first show that the MILP can be equivalently recast as a 
concave minimization problem. Next, we reformulate the concave minimization 
problem in the form of a DC program then use DCA to solve. We test the 
proposed algorithm on three different railway network topologies including one 
topology introduced in [18] and two practical topologies in Northern Vietnam. 
The preliminary results demonstrate the efficiency of the proposed method. 
The paper is structured as follows. After the introduction, we describe the 
problem in Sect.2. Section3 presents DC programming, DCA and show how 
to apply DCA to the problem. In Sect.4, we provide some numerical experi- 
ments to evaluate the proposed approach. The last section is dedicated to some 
conclusions. 


2 Problem Description 


Among train scheduling problems in single-track railway line, the travel time 
optimization problem has specially attracted many researchers. Higgins et al. 
[14], Zhou and Zhong [32] introduced a mixed integer linear program for the 
aforementioned problem. Basing on the formulation in [14], Karoonsoontawong 
and Taptana in 2017 proposed a modified formulation [18]. This section presents 
the problem described in [18] with the following assumptions: Networks includes 
sidings or stations that divide railways into segments; Two trains or more are 
not allowed on any track segment; There are two tracks at sidings/stations and 
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each segment has the most 2 tracks; and a pre-specified path is assigned for each 
train. 


2.1 Notation 


The sets, the parameters, the variables are denoted as follows. 


Sets 

I = {1,2,...,n7} - set of trains, ny = |J|,n,; € N is total number of trains in 
the railway system; 

P = {1,2,...,np} - set of rail segments, np = |P|,np € N is total number of 
segments in the railway system; 

Q = {1,2,...,nq} - set of stations or sidings. ng = |Q|,ng € N is the number 
of stations (or sidings) in the railway system; 

P(1) - ordered set of rail segments traversed by train i € J; 

Q(t) - ordered set of stations (or sidings) traversed by train i € J; 

P, - set of single-track segments; 

P, - set of double-track segments; 

Ppsame-d(j 5) _ set of common rail segments for trains i and j, which traverse 
in the same direction; 

perr-4(; 5) ~ set of common single-track segments for trains i and j, which 
traverse in opposite directions; 

D is set of segment directions: inbound or outbound. 


Parameters 

qi(p, d) - starting station (or siding) of segment p in direction d; 

qo(p, d) - terminal station (or siding) of segment p in direction d; 

d;,» - direction in which train i € I traverses segment p € P(2); 

se - minimum headway between trains 7,7 € I traversing in the same direc- 
tion on p € P; 

hy’, - minimum headway between trains i,j € I traversing in opposite direc- 
tions on p € P; 

lp - length of segment p € P; 

Yio - earliest departure time of train 7 € J; 

uv, - minimum allowable average velocity of train 7 € J on segment p € P(%); 

0, - maximum achievable average velocity of train i € I on segment p € P(i); 
W;, - weight showing the priority for train i € I; 
Si - scheduled stop time for train i € I at station q € Q(i); 
M - sufficiently big constant. 


Decision Variables 

Ajjp equals to 1 if train i € I traverses track segment p € P**™*-“(i, j) before 
train 7 € J when trains 7 and 7 traverse track segment p in the same direction; 
0 otherwise. 

Bijp equals to 1 if train i € I traverses track segment p € P/??*(i, 7) before 
train 7 € J when trains 7 va 7 traverse track segment p in opposite directions; 0 
otherwise. 
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Xj, is the arrival time of train i € J at station/siding q € Q(i). 

Xj, 18 the departure time of train i € I from station/siding ¢ € Q(i). 
Xio is the departure time of train 7 € J from its origin station. 

Xip is the arrival time of train i € I at its destination station. 


2.2 Mathematical Model 


The problem can be formulated as follows: 


min} z = > Wi( ip - Yio}, (1) 


iel 


subject to: 


M x Aijp + | > x) 


a.da(Prdjp) + Mp Vit JE Tpe|e PMMAG,Z) (2) 


x? 


M «(1— Aijp) eee a,q2(p,d 


447 same_d 
Fen wp) thy ViAJETpEP (iJ 


M * Bijp + Xi AS hie 


( 
duty thy, Vit FeTpe Preis) (4 
( 


i ij inp 4 opp-d;: + 
M+(1— Bijy) +X nutes) 2 A ties Vif J ET, pE PyPr (i,9) (6 


i ; 1 
2 <x ia ae = Vi € I,p € P(t) (6) 
Up Up 
Xion 2>Vign Wiel (7) 
Xigt Sa <5 Xig Vie 1a € Qi) (8) 
Aijp € {0,1} i,7 € I,pe PO™-4(i, j) (9) 
Bijp € {0,1} i,j € I,p € PrP? (3, 9). (10) 


The objective function (1) minimizes the weighted sum of total train travel 
times. Constraints (2) and (3) state that for any two trains 7, 7 traversing the 
same segment p in same direction, Ajj, equals to zero if and only if train j 
traverses segment p before train 7, and train 7 must leave segment p for the 
period of hij, before train i can enter it. Constraints (4) and (5) also indicate 
that B;;, equals to zero if and only if train 7 traverses segment p before train 7 
with the time headway not less than the minimum safety headway. Constraints 
(6) ensure that the travel time of trains on any rail segment is in the range of the 
corresponding upper and lower limits. Constraints (7) allow the train departure 
time from its origin station to be bigger than or equal to its earliest departure 
time. Constraints (8) state that a train leaves a station siding after it arrives at 
this station and stops there for at least the scheduled stop time. 

The problem above is a mixed 0-1 linear programming. Finding a suitable 
method for solving this kind of problems is always challenging. The challenge 
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does not only come from the binary variables but also the size of problems. We 
propose here a method to solve the problem efficiently. For this, we first use 
the theory of exact penalization in DC programming [24] to reformulate the 
MILP as that of minimizing a DC function over a polyhedral convex set. The 
resulting problem is then handled by DCA which was introduced and extensively 
developed over the last decades [23]. The mentioned approach has been applied 
successfully in several large scale problems (see [20—23, 27] and reference therein). 
The details are provided in the following section. 


3 Solution Method 


3.1 DC Reformulation 


By using an exact penalty result, we reformulate the MILP in the 
form of a concave minimization program. The exact penalty technique 
aims at tranforming the orignal MILP into a more tractable equiva- 
lent problem in the DC optimization framework. Let S be the fea- 
sible set of the problem MILP (1)-(10). For notational simplicity, we 
group all arrival and departure time variables in a column vector 
Sa ete Seguro Sivan Cee nan eae Ung Ungt1 Usa]? » where 
T denotes the transpose operator; Uj = = q and Ut qta= = aq VeET,@eQ. 
In the same way, we group all the free canola. (includes Ajjp and Bijp) 
into a column vector C = [¢141,€1195-:4 Capnynes Cnr arp Oa eieil 2 
where Cjjp = Ajjp and Cin, +4i)jp = Bijp Vi,j € I,p € P. We denote a new set 
K := {(U,c) € 8: ce [0,1)?"7"""}. Assume that K is a nonempty, bounded 
polyhedral convex set in R??7"@ x R27minp | 
Therefore, the problem (1)—(10) can be expressed in the general form 


(Uspt; Cops) = argmin{z : (U,c) € S,c € {0,1}°"7""""}, (11) 
where z = 0-7 Wi(Xip — Yio). 
Let us consider the function p defined by 
p(U, c) := S- MiN{ Cijp, 1 — Cizp}.- (12) 
i,j€LI;pEP 


It is clear that p is concave and finite on K, p(U,c) > 0 V(U,c) € K, and 
{(U,c) € S: ce {0,1}?™7™"P} = {(U,c) € K : p < 0}. Hence, Problem (11) 
can be written as 


(Uopt; Copt) = argmin{z : (U,c) € K,p(U,c) < 0}. (13) 
The following theorem is in order. 


Theorem 1. Let K be a nonempty bounded polyhedral convex set, f be a finite 
concave function on K and p be a finite nonnegative concave function on K. 
Then there exists ty > 0 such that for t > to the following problem has the same 
optimal value and the same optimal solution set: 
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(Pi) a(t) = {f(x) + tp(x): # € K} 
(P) a=min{f (x): a © K,p(a) < 0} 


Furthermore, 


e If the vertex set of K, denoted by V(K), is contained in x € K : p(x) < 0, 


then to = 0. 
e If p(x) > 0 for some x € V(K), then to = minf fe)" oy) :a2 € K,p(x2) < 
0 


0}. where So = min{p(x): a € V(K), p(x) > O}. 
Proof. The proof for the general case can be found in [24]. 


From Theorem 1| we get, for a sufficiently large number ¢ (t > to), the equiv- 
alent concave minimization problem to (13) 


mintz + DU, c): (U,c) € K} (14) 


which is a DC program of the form 


min} g(U, c) — h(U, o)} (15) 


where g(U,c) = xx«(U,c) and h(U,c) = —z — ty pepe min{Cijp, 1 — cizp}- 
xK«(U,c) = 0 if (U,c) € K, otherwise +00 (the indicator function of K). 

We have successfully tranformed an optimization problem with integer vari- 
ables into its equivalent form with continuous variables. 


3.2. DC Algorithm for (14) 


Now, we investigate a DC programming approach for solving (14). A DC program 
is that of the form: 


Q:= min{ f(z) = g(a) — h(a): a € R"} (16) 


with g, h being lower semi-continuous proper convex function on R”, and its 
dual problem is defined as 


Q:= min{ h*(y) —g(yiye R"} (17) 


where g*(y) := masa? y —g(x):x2eE R"} is the conjugate function of g. 

Based on local optimality conditions and duality in DC programming, the 
DCA consists in the construction of two sequences {x*} and {y*}, candidates 
to be optimal solutions of primal and dual programs respectively, in such a 
way that {g(x*) — h(a*)} and {h*(y*) — g*(y*)} are decreasing and their limits 
points satisfy the local optimality conditions. The idea of DCA is simple: each 
iteration of DCA approximates the concave part —h by its affine majorization 
(that corresponds to taking y* € Oh(«x*)) and minimizes the resulting convex 
function. 
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Generic DCA scheme: 
Initialization Let 2° € R” be a good guess, k < 0; 
Repeat 
Calculate y* € Oh(x*); 
Calculate r*t! € argmin{ g(e) —h(a*) —(a—a2*,y*\): re R"} (Px); 
k+1<k; 
Until Convergence of 2*. 
The convergence properties of DCA and its theoretical basis can be found in 
[23], for instant it is important to mention that: 
e DCA is a descent method without line search; 
e If the optimal value of problem (16) is finite and the sequence {x*} is 
bounded then every limit point «* of {2} is a critical point of g — h; 
e DCA has a linear convergence for general DC programs; 
e DCA has a finite convergence for polyhedral DC programs ((16) is called 
polyhedral DC program if either g or h is polyhedral convex). 
We now describe the DCA applied to the DC program (14). By definition, a 
sub-gradient (v",d*) € 0h(U*,c*) can be chosen as follows: 
e vi = —W;, if q is destination station on travel path of train 7, otherwise 0; 
e dijp =t if cijp > 0.5, otherwise dij, = —t for all i,j € I,p € P. 
By using (v*,d*), we then compute (U*+!, c**+1) by solving the linear pro- 
gram: 


(Ut or) = argmind xx(U, ce) — (U,¢) - (UF), (y"a")) :(U,c) € K} 
(18) 


(oe) a argmin{ - , C), (v"d")) :(U,c) € K} (19) 


Thus, the DCA applied to (14) is as follows: 
Algorithm DCA: 

Let k = 0; er = 10°; 

Choose a sufficiently small positive number e¢; 

Choose an initial point (U",c*); 

while er > « do 
Compute (v*,d*) € 0h(U*, c*); 
Solve { — ((U,c), (v*,d*)) : (U,e) € K} to obtain (U**!, c*+1); 


Compute error er =|| (U**+1, c**+) — (U*, c*) ||; k= k +1; 
endwhile 
Regarding the complexity of the proposed DCA, besides the computation of 
the sub-gradients which is trivial, the algorithm requires one linear program at 
each iteration and it has a finite convergence. The linear program has polynomial 
complexity. The convergence of Algorithm DCA can be summarized in the next 
theorem [29]. 
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Theorem 2. (i) Algorithm DCA generates a sequence {(U*,c*)} contained in 
V(K) such that the sequence {g(U*,c*) — h(U*,c*)} is decreasing. 
(ii) If at iteration r, we have c” € {0,1}?72™"P, then c® € {0,1}277™"P and 
FOr) a FU... Var 
(iti) The sequence {(U*,c*)} converges to {(U*,c*)} € V(K) after a finite num- 
ber of iterations. The point (U*,c*) is critical point of the problem (14). 
Moreover such an (U*,c*) is almost always a strict local minimum of prob- 


lem (14). 


Basing on the second affirmation of the theorem above, we should choose 
a feasible solution as an initial point of DCA. This ensures that the solution 
obtained by DCA is feasible to the original problem although DCA works on the 
continuous domain. The way we choose an initial point is as follows: We arrange 
the trains in the order of the earliest departure time at the original station of 
the itinerary. With each train, we determine the schedule on the whole itinerary 
for that train. The generated schedule of the next train is based on the schedule 
which was created by the previous trains to ensure that no conflict occurs. The 
result is a feasible solution of the problem. 


4 Computational Experiments 


In this section, we provide preliminary computational results of our approach. We 
have coded the algorithm in C++ programming language and tested instances 
using PC Intel core i7 3770 3.4 GHz, 16 GB RAM. The solver CPLEX 12.6.1 
is used to solve the linear program in each iteration of DCA and get the opti- 
mal value of MILP. We investigate the algorithm performance on three network 
topologies that are the topology shown in [18] (toy network) and two topologies 
in Northern Vietnam (HN-HP, HN-LC network). The toy network includes 3 
trains, 5 segments, 6 stations and the total length of segments is 297 km. The 
HN-HP network is a single-track railway system connecting Hanoi capital and 
Hai Phong city, including 8 trains (4 inbounds, 4 outbounds), 7 segments and 8 
stations. The total length of HN-HP network is 102km. The HN-LC network is 
a line connecting Hanoi capital and Lao Cai province. It is more complex as it 


Table 1. The size of testing networks 


Toy network | HN-HP network | HN-LC network 
Trains 3 8 8 
Segment 5 7 27 
Stations/sidings 6 8 28 
Continuous Variables | 36 128 448 
Binary variables 90 896 3456 
Constraints 216 1920 7360 
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consists of 8 trains (4 inbounds and 4 outbounds), 27 segments, 28 stations and 
the total distance is 294km. The size of the networks are shown in Table 1. 

In each network, 10 instances are generated by modifying the train parame- 
ters composed of the scheduled stop time at stations, the earliest departure time 
and the weight of trains. The tested results for three networks are presented in 
Table 2, 3, and 4 where ValDCA is the objective value obtained by the algo- 
rithm DCA; CPU is the computing time; Opt Val is the optimal value; and GAP 
_ ValDCA — OptVal x 100% 

Opt Val , 


Table 2. The computation result for Toy network 


Instance | ValDCA | CPU | OptVal GAP 

1 2347.022 |0.011 2190.053 | 7.17% 
2 2319.022 |0.018 2174.437 | 6.65 % 
3 2199.714 |0.016 2199.714 | 0.00 % 
4 2175.438 |0.021 2143.438 | 1.49% 
5 2281.022 |0.019 2136.438 | 6.77% 
6 2175.715 |0.014 2175.715 | 0.00 % 
a 2164.438 | 0.022 2136.438 | 1.31% 
8 2223.715 |0.016 | 2223.715 | 0.00 % 
9 4420.691 |0.018 4420.691 | 0.00 % 
10 4370.568 | 0.021 4224.106 | 3.47% 
Average | 2667.735 | 0.018 | 2602.475 | 2.69% 


From the tables of results, we can see that: 


— The computing time of DCA is good for all the instances. Even for the big 
size (HN-LC network with 3456 binary variables and 7360 constraints), the 
time to get solution is still small. 

— The GAP numbers are very small. This means the solution quality is quite 
good. For the Toy network, there are 4 out of 10 instances in which GAP 
equals to zero, i.e. DCA furnishes the optimal solution for 40%. 
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Table 3. The computation result for HN-HP network 


Instance | ValDCA|CPU | OptVal | GAP 

1 6755 0.077 | 6745 | 0.15 % 
2 6749 0.061 | 6739 | 0.15 % 
3 5493 0.075 | 5455 | 0.70 % 
4 5482 0.076 | 5442 | 0.74% 
5 5447 0.067 | 5445 | 0.04 % 
6 5283 0.081 | 5273 |0.19 % 
7 5340 0.068 | 5301 | 0.74 % 
8 17014 0.071 | 16744 |1.61 % 
9 16722 0.063 | 16659 | 0.38 % 
10 10087 0.077 |10015 | 0.72 % 
Average 8437.2 | 0.072 | 8381.8 | 0.54 % 


Table 4. The computation result for HN-LC network 


Instance | ValDCA |CPU | OptVal | GAP 

1 4546 0.385 | 4470 1.70 % 
2 4543 0.381 | 4505 0.84 % 
3 4618 0.342 | 4528 1.99 % 
4 4956 0.357 | 4687 5.74 % 
5 4996 0.386 | 4683 6.68 % 
6 4546 0.356 | 4470 1.70 % 
7 7311 0.335 | 7150 2.25 % 
8 11322 0.379 | 11137 1.66 % 
9 6008 0.373 | 5914 1.59 % 
10 6054 0.335 | 5969 1.42 % 
Average | 5890.00 | 0.363 5751.30 | 2.56 % 


5 Conclusion 


In this paper, we have studied a model minimizing total travel time of trains 
in single track networks. We have shown that the aforementioned problem can 
be formulated as a mixed-integer linear program. Realizing the inherent diffi- 
culty in computing the optimal solution of MILPs, our main contribution was 
to propose a computationally efficient approach based on DCA. The considered 
combinatorial optimization problem has been beforehand reformulated as a DC 
program with a natural choice of DC decomposition, and the resulting DCA then 
consists in solving a finite sequence of linear programs. DCA is original because 
it gives an integer solution while it works in a continuous domain. Preliminary 
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numerical results were encouraging and demonstrated the effectiveness of the 
proposed method. The short computing time and the capacity for handling the 
large-scale instances make the proposed method valuable. Moreover, notice that 
most problem formulations arising in train scheduling can be formulated as some 
sort of MILP problems, our proposed approach seems attractive and needs more 
investigation. 
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Abstract. We study an approach based on DC (Difference of Con- 
vex functions) programming and DCA (DC Algorithm) for the convex 
piecewise-linear fitting problem. The objective is to fit a given set of data 
points by a convex piecewise-linear function. The problem is formulated 
as minimizing the squared ¢2-norm fitting error, and then reformulated 
as a DC program for which a standard DCA scheme is applied. Fur- 
thermore, a modified DCA scheme with successive DC decomposition is 
proposed with the aim to improve DCA by updating the convex approx- 
imation of the fitting error function during DCA iterations. These DCAs 
consist in solving a sequence of convex quadratic programs. Moreover, the 
modified DCA still has the same convergence properties as the standard 
DCA. Numerical results on synthetic/real datasets show the efficiency of 
our methods when comparing with the existing approaches. 


Keywords: DC programming - DCA - DCA with successive DC 
decomposition - Convex piecewise-linear fitting 


1 Introduction 


The problem of fitting a set of data points by a certain function has been stud- 
ied extensively. Our work focuses on solving the problem of fitting a given set 
of m points (x;,y;) € R” x R by a convex piecewise-linear continuous func- 
tion f : R” — R with K affine functions (K > 1), of the general form 
f(x) = Maxj=1,...,.K {(aj;,x) + bj} with (a;, b;) € R” x R, j = 1, i yp dS Con- 
sidering the least-squares criterion, this fitting problem aims to find the vectors 
(a;,b;) such that the mean-square error (MSE) + 7", [f(x:) — yi]? is as small 
as possible (see, e.g., [13]). It can be formulated as the following squared ¢j-norm 
optimization formulation 


oy (, max, {(aj.x:) +8)} —) (1) 


i=1 
s.t.a= (az, by, ae, be, ieee ,aK, bx) ss Reve) 
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It is easy to see that the problem (1) is, in general, not convex. Hence globally 
solving this problem is difficult in large-scale setting. 

This problem has many applications ranging from mathematical model- 
ing/optimization, to econometrics, transportation and forecasting, to statistics, 
machine learning and data mining (see, e.g., [1,5,13,14,19,20] and references 
therein). In machine learning, developing regression techniques using piecewise- 
linear functions plays an important role for practical problems such as energy 
storage optimization with a solar source, beer brewery optimization, customer 
demand forecasting (see, e.g., [4,7,20] for more practical problems). 


Related Works. Previous and relevant works for piecewise-linear fitting func- 
tions are given completely in, e.g., [1—-4,7, 13, 18-20]. In most of these works, the 
attention is on constructing any piecewise-linear function in which the number of 
affine functions is updated increasingly to obtain a better fit. The recent work of 
Balazs [4] showed that for a class of sub-Gaussian fitting problems, there exists 
a near-optimal convex piecewise-linear fitting function with at most [m”/("+4)] 
affine functions. On the contrary, our work focuses on searching among convex 
piecewise-linear functions with a fixed number K of affine functions. Works in the 
same direction can be found, for example, in [6,13,19]. Several works for special 
convex piecewise-linear functions with the ¢,,/¢;-norm-based fitting criterion 
were mentioned in [13]. When K = m, Boyd and Vandenberghe [6] reformulated 
(1) as a convex quadratic program with m(n+ 1) variables and m(m — 1) linear 
inequality constraints, which is impractical for medium/large datasets. Magnani 
and Boyd [13] proposed a fast, heuristic Gauss-Newton method, named Least- 
square partition algorithm (LSPA). Its idea is the same as the K-mean algorithm 
for clustering: it partitions the set {x;}:=1,.... into K subsets based on the cho- 
sen KC centroids, then fits an affine function to each subset by solving a linear 
least-squares problem, next updates the K centroids, and repeats until there 
is no change in partitioning. LSPA depends on the starting points and does 
not ensure to converge even for the small datasets [7,13]. On the other hand, 
Toriello and Vielma [19] investigated an exact approach, named MIQPM, for 
(1). The authors used the big-M technique to reformulate (1) as a mixed-integer 
quadratic program containing K(n + 1) continuous variables, mK binary vari- 
ables and more than m(2K + 1) linear constraints. Solving this mixed-integer 
quadratic program becomes computationally expensive due to a huge number of 
binary variables; moreover, how large value of big-M to trade off between the 
quality of fitting error and the rapidity is still in question [18]. 

Overcoming these disadvantages of both heuristic and exact approaches moti- 
vates us to develop efficient algorithms for solving the nonconvex, nonsmooth 
fitting problem. Our algorithms have the advantage of LSPA and MIQPM: they 
require solving a convex quadratic program at each iteration, like LSPA, and still 
always guarantee to converge (quite often to globally optimal solutions in prac- 
tice), like MIQPM. The backbone of our approach is DC (Difference of Convex 
functions) programming and DCA (DC Algorithm) (see, e.g., [11,12,15,17] and 
the references in [9,12]) which are well-known as powerful nonsmooth, nonconvex 
optimization tools. DCA aims to solve a standard DC program that consists in 
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minimizing a DC function F = G — H (with G, H being convex functions) over 
a convex set or on the whole space. Here G — H is called a DC decomposition 
of F', while G and H are DC components of F. The idea of standard DCA is, 
at each iteration k, approximating the second DC component H by its affine 
minorant H; and then solving the resulting convex subproblem. In other words, 
one approximates the DC function F by the convex majorization F* := G— Hy. 
It is clear that the closer to F the function F” is, the better DCA could be. 
Le Thi and Pham Dinh [12] has recently introduced a modified version of the 
standard DCA scheme, named DCA with successive DC decomposition. In this 
version, the DC decomposition of F is successively updated during DCA itera- 
tions in order to better approximate the DC function F i.e. F := G* — HF at 
each iteration k and its corresponding convex majorization is F* := G* — HE. 
Our approach based on DCA with successive DC decomposition for this work is 
similar to the recent work [10] for reinforcement learning problems. In a related 
work with the use of DC optimization, Bagirov et al. [83] considered the con- 
tinuous piecewise-linear fitting problem over the class of maxima of minima of 
linear functions (which includes convex piecewise-linear functions). The authors 
indicated that the objective function is a DC function without getting explicit 
DC components and designed an algorithm based on the subdifferentials of DC 
components. However, the computation of these subdifferentials stated in Propo- 
sition 2 in their work is not correct. 

In this paper, we address the convex piecewise-linear fitting problem (1) by 
an approach based on DC programming and DCA. Particularly, we formulate 
(1) as a DC program for which a standard DCA scheme is developed. Its DC 
decomposition is designed by using the affine minorization of each convex fitting 
function at an arbitrary given point. Changing this point at each iteration of 
DCA leads to a modified DCA version with successive DC decomposition for 
the problem (1). The resulting subproblem in these DCAs can be equivalently 
reformulated as a convex quadratic program with (2m+ K(n+1)) variables and 
m(4 +1) linear inequality constraints. Especially, the modified DCA still has 
the convergence properties of the standard DCA: it is a descent algorithm with 
global convergence (i.e. it always converges from an arbitrary starting point). We 
provide several numerical experiments of our DCA-based algorithms on various 
synthetic/real datasets in comparison with the heuristic/exact algorithms LSPA, 
MIQPM for solving the problem (1). 

The rest of the paper is organized as follows. How to apply DC programming 
and DCA to the considered problem is shown in Sect. 2. Section 3 presents the 
numerical results on benchmark datasets. Finally, Sect. 4 concludes the paper. 


2 Solution Method by DC Programming and DCA 


DC programming and DCA were introduced by Pham Dinh Tao in a preliminary 
form in 1985 and have been extensively developed by Le Thi Hoai An and Pham 
Dinh Tao since 1994. DCA is well-known as an efficient approach in the noncon- 
vex programming framework. In recent years, numerous DCA-based algorithms 
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have been developed for successfully solving large-scale nonsmooth/nonconvex 
programs in several application areas (see the list of references in [9,12]). For 
a comprehensible survey on thirty years of development of DCA, the reader is 
referred to the recent work [12]. 
The idea of DCA relies on the DC structure of the objective function Ff’. The 
standard DCA scheme is described below. 
Standard DCA scheme 
Initialization: Let a° € R? be a best guess. Set k = 0. 
repeat 
1. Calculate 3" € OH(a*),. 
2. Calculate a*+! € argmin{G(a) — H(a*) — (a—a*, B*) : a € RP} (Pr). 
3.k=k+1. 
until convergence of {a*}. 
Convergence properties of the standard DCA are described completely in 
(11, 15,16]. 


2.1 DCA for Solving the Problem (1) 


The convex piecewise-linear fitting problem (1) can be rewritten as 


m 


min {Fy = > [pi(a))? : a € aron)| (2) 


i=1 
where for i= 1,...,m, the function p; : RX“+) — R, 
the (ia)) i 
pi(a) = max, {(a,209) — ys}, 
and for i = l,...,m, j = 1,...,K, the vector 2@J) = (2)... 2b) € 


R¥(@™+)) with Ze) = (x;,1) if r = j, (0,0) if r 4 j. Obviously, p; is a con- 
vex piecewise-linear function. 

It is known from, e.g., [15] that if p; is a DC function with a nonnegative DC 
decomposition then [pi]? is DC too. Thus, using the convexity of the function 
pi, we highlight a nonnegative DC decomposition of p; (see [10]). In particular, 
we define the affine minorization of p; at an arbitrary point @ (i = 1,...,m) as 
follows: 


Li(a) = pi(@) + (a— 2,7) = (057i) — ys with 7, € Opi(@). 
Let (l;)~ := max{0, —1;}. Obviously, the functions p; + (1;)~ and (J;)~ are non- 


negative and convex on R*("+)), and so are [p; + ms and Ga? Then we 
have the nonnegative DC decomposition of p; as follows: 


pi = [pi + (i)~] — (i) 
It leads to a DC decomposition of [p;]?, that is, 


[pil = 2{ [pe + ()-]? + [)-]} - fre + 20)7] 


2 
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As a result, F is a DC function. We thus derive the DC formulation of (2) as 
follows: 
min{ F(a) := G(a) — H(a):a € REM} (3) 


where G = so 2 { (pi + Gey + (eT }, 


t=1 


H= XS e+ 208)" i 


Applies the standard DCA scheme to (3) leads us, at the iteration k, to 
compute a subgradient G* € OH(a*) and a solution a**! to the following convex 
program of the form 


The last program can be reformulated as 


min ) > (277) + $5 (2¢?) — (3*,a) 
i=1 i=1 
s.t. t; > (li)” (a), Ti > ti + p;(a),i => 1, eee MM, 


which is in fact a convex quadratic program of the form 


m 


min > | (277) + 2 (287) — (6*, a), (4) 


s.t. t; >0, ti > yi — (a,7;), = 1,...,m, 
7% > ty + (a, 2) — y;, i=1,...,m,j=1,...,K. 


Compute the Subdifferential 0H: We have 


OH(a) = \ 8 [pi(a) + 2(li)~ (a) . 


Ali) (a) = {0} ifli(a) > 0, [0,—7,] if (a) =0, 
{—7,} otherwise, 
Op;(a) = 0 | aax, { (a, 24) - ui} | 


cof{z*F) : 5; € I;(a)}, and 


I,(a) := argmax;—, x(a, 2), 


Prey 


Here [0, —7,] is a line segment between 0 and —7,, co ae the convex hull 
of a set of points. Hence we take eee 7, € Opi(a y € Op;(a*) and 
G* € OH(a*) as follows: 7; € I;(@), qe € I;(a*), 7; = z(4:) i717 kb glist) 


pr =2 > Pilea") + 21) (a") (of — 27:1 ,(a) <0} ("*)). (5) 
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Here the function 14(a) = 1 if a € A, 0 otherwise. 
Finally, DCA applied to (3) is summarized in Algorithm 1. 


Algorithm 1. Standard DCA for solving (3) (DCA) 


Initialization: Let ¢ be a sufficiently small positive number. Let a® € RA (+), 
Set k = 0. 
repeat 
1. Compute 3" € OH(a*) using (5). 
2. Solve the convex quadratic program (4) to obtain (a*+1, t&+1, 7*+1), 
3.k=k+1. 
until Stopping criteria are satisfied. 


Theorem 1. Convergence properties of DCA 


(i) DCA generates the sequence {a*} such that the sequence {F(a*)} is decreas- 
ing. 

(ii) Every limit point a* of the sequence {a*} is a critical point of G—H i.e. 
OG(a*) NOH(a*) 4G. 


2.2 DCA with Successive DC Decomposition for Solving 
the Problem (1) 


From DCA for solving (3), we see that the affine minorization 1; of p; is computed 
easily with any point @, and the closer to p; the function 1; is, the better DCA 
could be. Hence, we suggest a modified DCA with successive DC decomposition 
for the problem (1) (see [10]), i.e. J; is updated during DCA iterations by choosing 
a@ = a* at the iteration k. Particularly, at the iteration k, we set y¥ = Zb5t) E 
dpi(a*), 

F(a) = pi(a®) + (a— a, yf) = (arf) — yi 


The resulting DC formulation of (2) at the iteration k takes the form 


min{ F(a) := G*(a) — H*(a): a € REMY} (6) 


where G* = ¥ {ls + ()-]* + [@)-]’3, 


a uo [pi + 2(iF)- fe 
Sint DCA with successive DC decomposition applied to (6), named 


DCA*, need to compute two sequences {3*} and fa*} such that 3” € OH*(a*) 
is calculated as 


gk =2) nic! jas) where j* € I;(a*), (7) 
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Algorithm 2. DCA with successive DC decomposition for solving (6) (DCA*) 


Initialization: Let ¢ be a sufficiently small positive number. Let a® € RA(@+), 
Set k= 0. 
repeat 
1. Compute 3” € 0H*(a*) using (7). 
2. Solve the convex quadratic program (8) to obtain (a*+1, t&+1, 7*+1), 
3.k=k+1. 
until Stopping criteria are satisfied. 


a*+1 is an optimal solution to the convex quadratic program 
m m 
min ) (277) +) (247) — (6%, a), (8) 
i=1 i=1 


st. t; >0, > yj —(a,2%)), ¢=1,...,m 
>t +(a,2)) —y,, i=1,...,m,j=l,... 


Especially, the convergence properties of DCA is still valid for DCA* as stated 
in Theorem 2. Its proof is similar to the proof of Theorem 2 in the recent work 
[10]. 


Theorem 2. Convergence properties of DCA* 


(i) DCA* generates the sequence {a*} such that the sequence {F(a*)} is 
decreasing. 

(ii) Every limit point a* of the sequence {a*} is a critical point of F = G° —H® 
where the functions 


om =o 2{ [a+ Oey] + ey) he = > b+ 20-7? 
i=1 i=1 
and 1° is the affine minorization of p; at a*,i=1,...,m. 


2.3 Starting Point for DCA 


We suggest a deterministic strategy to compute the starting point for our DCAs 
that is quite similar to the random version in [13]. It consists of three main 
_ m by the KKZ method 
[8] but the first point is set as X = + 7)", x;; (Step 2) find j? € {1,...,K} (i= 
1,...,m) such that the j?-th point in these K points is closest to the point x; (see 
Voronoi partitions in [13] for more details); (Step 3) compute the starting point 
a® = (al, b},..., ah, b&) where for j = 1,...,K, (af,b9) is an optimal solution 
to the linear least-square problem minia,,b;) ie 7(j) ((aj, Xi) + Oj — yi)” , where 


I(j) = {0 € {1,...,m} : 7? = jf. 
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Table 1. Descriptions of six synthetic datasets: n is the dimension of the space; 1 is 
the set of points x; with the size of m; q(x) is the function used to generate y; = q(x:); 
K := [m"/(+] is the number of affine functions. 


Dataset Description 

Datal n=3,m=729, K =17, X = {0,...,8}°, g(x) = 0.01(exp(a1) + 0.543 + a3) 
Data2 ([4])|n € {4,6, 8}, m € {1000, 1500, 2000}, K = [mn/(r+4)], 

The points x; are generated from the uniform distribution U/({—2, 2]”), 

yi = 9%) +, ~N(O,1), a(x) = maxjai,....n43{(Vw(X;), x) + w(X)}, 
w(x) = (1/2)x! Hx + (f,x) — n, f = (1/n,2/n,...,1) € R®, 

H,,; = (14+ 2i)/(2%) if = 3, (¢+9)7} otherwise, 


Xj =e; — (k/d)1 for j = 1,...,n; ¥n41 = 0, Xn42 = —1, Xn43 = 1, 
e; is the j-th unit vector in R”, 1 is the vector of ones in R” 
Data3 Data2 with w(x) = (1/2)[(x)+]  H(x)+ 
Data4 Data2 with m = 2000, « = 0 
Datad n=5, m= 1024, K = 47, X = {-2,-1,1,2}5, q(x) = 1/2x' Hx + (f,x) —n 
Data6 Datad with ¥ = {0,1,2,3}5, q(x) = 0.01(a1 + 0.542 + #3)? — v4 + (a5)? 


3 Numerical Experiments 


In the section, we make a comparison between DCA, DCA* and two heuris- 
tic/exact algorithms LSPA [13], MIQPM [18,19] on six synthetic datasets, simi- 
larly taken from the literature (see, e.g., [4,7,13]) and on six real, large regression 
datasets in different areas from UCI Machine Learning Repository’ and LIBSVM 
website”. The detailed descriptions of synthetic/real datasets are summarized in 
Tables 1 and 3. 


Comparison Criteria: We consider the following criteria: the root-mean- 
square (RMS) error defined as RMS(a) := (4+ 7”, [pi(a)]?)'/?, the CPU time 
(in seconds), and the number of the so-called active affine functions where a 
function (a},x) + b; in the obtained function f(x) = maxj=1,..« { (a,x) + bs} 
is active if it is the maximum at some point x; € Vv. 


Set Up Experiments: All experiments were tested in MATLAB R2016b on 
a PC Intel(R) Xeon(R) CPU E5-2630 v2, @2.60 GHz of 32GB RAM. We use 
CPLEX 12.8 for solving convex quadratic programs in DCA, DCA*, and mixed- 
integer quadratic programs in MIQPM. The function 1sqlin in MATLAB was 
used for solving linear least-squares problems in LSPA. The big-M of MIQPM is 
set to 1000 [1,18] and the maximum executing time for a call to CPLEX to 1000s. 
As LSPA is fast but very sensitive to the starting point, and its convergence is 
not ensured [13], the final RMS error of LSPA is reported by the best result found 
in the N runs with random starting points and thus its CPU time is calculated 
as the total executing time of LSPA over these runs. The maximum number 


" http://www.ics.uci.edu/~mlearn/MLRepository.html. 
? https: //www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. 
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Fig. 1. The RMS error of DCA, DCA* and LSPA versus the CPU time on Datad: 
n = 8, m = 2000. 


of iterations of LSPA for each run and the value of N are set to 50 and 200 
respectively (see [13]). As for DCAs, the starting point is described in Sect. 2.3 
and the stopping criterion is |RMS(a*) — RMS(a*~!)| < e(1 + RMS(a*—!)). 
The default tolerance ¢ is set to 10~° for Datal, Data6; 10~? for the others. The 
point @ is set to a? in DCA. As for real datasets, we divide each dataset into 
two subsets: training set containing 75% of dataset, and test set containing 25% 
of dataset. First, we learn a PL model on the training set and collect its RMS 
error. Then, we use that PL model on the test set to compute the RMS error 
(named Test RMS). 


Descriptions of Results’ Table, Figures: The comparative results of all four 
algorithms on synthetic datasets in terms of the RMS error, the CPU time and 
the number of active affine functions are reported in Table 2. Moreover, we also 
perform and plot the RMS error (resp. the best RMS error) of DCA, DCA* 
(resp. LSPA) versus the CPU time on Data4 (n = 8, m = 2000) in Fig. 1. As for 
real datasets, we report several results of DCA* and LSPA in Table 3. 


Comments on Computational Results 


e In terms of the RMS error, both DCA and DCA” are efficient and they are 
better than other comparative algorithms on all synthetic datasets. Indeed, 
the RMS error of the standard DCA is better than that of LSPA and MIQPM 
with the ratio of gain from 0.69% to 99.0% in 22/24 cases and from 27.0% 
to 98.9% in all cases, respectively. Meanwhile, DCA*® obtains the best: fit- 
ting error on all datasets — the ratio of gain of DCA* versus DCA, LSPA, 
and MIQPM varies from 1.01% to 66.7%, from 7.42% to 89.9%, and from 
29.1% to 99.1%, respectively, for Datal, Data2, Data3, and Data5. Espe- 
cially, as for Data4 and Data6, DCA* seems to furnish the globally optimal 
solution to the problem (1) (with the RMS error less than 10~° after about 
200s, see Fig.1) and the ratio of gain is significantly increasing, it varies 
from 82.7% to 97.1%, from 85.3% to 99.8%, and around 99.8%, respectively. 
This turns out the advantage of updating the DC decomposition in DCA*. 
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Table 3. Comparative results of DCA* and LSPA on six real datasets. 


Real dataset m n |K (Test RMS Training CPU 
DCA* LSPA |DCA* LSPA 

winequality-red 1599 11 |224 5.14e+00|7.08e+00 | 7.09e+01 | 8.34e+01 
space_ga 3107 | 8 |125 | 1.19e—01 | 1.80e—01/5.01e+01 |5.82e+01 
abalone A177 | 8 |259 | 2.17e+00 | 2.51e+00 | 8.39e+01 | 2.02e+02 
winequality-white| 4898 |11 |508 |'7.16e—01 | 1.85e+01)2.84e+02|5.12e+02 
CCPP 9568 | 4 | 98 |4.16e+00/7.89e+00/1.58e+02 |1.35e+02 
cadata 20640 | 8 |500 | 2.76e+06 |3.00e+06 | 8.40e+02 | 1.32e+03 


In our experiments, the best RMS error of LSPA is moderately improved (see 
Fig. 1) while MIQPM does not work in most of the medium/large datasets as 
running out of memory or exceeding the limited time (1000s). Moreover, the 
smaller the stopping tolerance ¢ is, the slightly better the RMS error of our 
algorithms would be. 

e Concerning the CPU time, DCA* is the fastest in most cases — the ratio of 
gain of DCA* versus DCA, LSPA, and MIQPM varies from 1.05 to 2.86 times 
in 21/24 cases, from 1.01 to 5.47 times in 21/24 cases, and from 11.8 to 165 
times in all cases, respectively. DCA runs slower than LSPA from 1.00 to 
4.37 times in 12/24 cases in particular for the large datasets. Updating the 
DC decomposition improves the rapidity of DCA. Note that the CPU time of 
the four algorithms depends on the fixed number K of affine functions. From 
Table 2, the number of active affine functions of DCA and DCA* is smaller 
than the fixed value of kK and that of LSPA in most cases. This observation 
leads us to a perspective: how to update the number of affine functions in our 
DCAs to improve the rapidity but still ensure the quality of fitting error. 

e As for real datasets, we observe that the RMS of DCA* is significantly better 
than that of LSPA in all test sets — the ratio of gain in terms of Test RMS 
varies from 8% to 96.1%, specially 96.1%, 47.28%, 8% on the large datasets 
with size of 4898, 9568, 20640, respectively. Meanwhile, the CPU time of 
DCA* is reasonable (less than 900s on large datasets) and faster than LSPA 
on 5/6 datasets. 


4 Conclusions 


We have investigated DC programming and DCA for solving the convex 
piecewise-linear fitting problem. Two DCA schemes, DCA and DCA*, have been 
developed. Differing from the standard DCA, the DC decompositions in DCA* 
are updated during DCA iterations to better approximate the DC objective func- 
tion. Numerical results on six synthetic/real datasets have turned out that the 
standard DCA is efficient and DCA* improves both the quality and the rapidity 
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of DCA. DCAF# is then suggested as the best among related existing approaches 
for this problem. From these promising results, we plan in future works to update 
the number of affine functions in our DCA schemes, to explore other DC decom- 
positions, and to develop our DC programming and DCA-based approach for 
more general fitting problems with DC fitting functions, and different fitting 
criteria as well. 
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Abstract. Unlike convectional omnidirectional sensors that consistently 
have an omniangle of detecting range, directional sensors may have a 
restricted point of detecting range because of specialized requirements 
or cost considerations. A directional sensor system comprises of various 
directional sensors, which can switch to several directions to broaden 
their detecting capacity to cover every one of the objectives in a given 
territory. Power preservation is still a significant issue in such direc- 
tional sensor networks. In this paper, we consider the multiple directional 
cover sets (MDCS) problem of organizing the directions of sensors into 
a group of non-disjoint cover sets to extend the network lifetime. It is an 
NP-complete problem. Firstly, a new model of MDCS is introduced in 
the form of Mixed Binary Integer Linear Programming (MBILP). Sec- 
ondly, we investigate a new method based on DC programming and DC 
algorithm (DCA) for solving MDCS. Numerical results are presented to 
demonstrate the performance of the algorithm. 


Keywords: MDCS - Mixed Binary Integer Linear Programming 
(MBILP) - DC programming - DC algorithm (DCA) 


1 Introduction 


In promising stages for some applications, for example, ongoing years, sensor sys- 
tems have risen as natural checking, combat zone reconnaissance, and medicinal 
services [2,3]. A sensor system may consists of an enormous number of sensor 
nodes that are composed of detecting, information handling, and communicat- 
ing components. The ordinary research of sensor systems is constantly founded 
on the suspicion of omnidirectional sensors that have an omniangle of detect- 
ing range. Nonetheless, sensors may have a restricted edge of detecting range 
because of the specialized limitations or cost contemplations, which are meant 
by directional sensors. 

Video sensors [4,5], ultrasonic sensors [6], and infrared sensors [3] are models 
of broadly utilized directional sensors. Note that the directional trademark we 
talk about in this paper is from the perspective of the detecting, but not from 
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the communicating activity of sensor nodes. There are a few different ways to 
expand the detecting capacity of directional sensors. One path is to put several 
directional sensors of a similar kind on one sensor node, every sensor has a 
different direction. Another way is to furnish the sensor node with a mobile device 
that permits the node to move around. The third path is to equip the sensor 
node with a gadget that empowers the sensor on the node to switch (or turn) 
to various directions, then a sensor can capture several directions. In this paper, 
we consider the scenario that every sensor node prepares precisely one sensor 
on it. A set of targets with given location are deployed in a two-dimensional 
Euclidean plane. Various directional sensors are arbitrarily dispersed near these 
targets. We assume that the detecting locale of each direction of a directional 
sensor is an area of the detecting disk centered at the sensor with a detecting 
span. Every sensor has a uniform detecting district and the detecting areas of 
different directions of a sensor do not overlap. 

At the point when the sensors are randomly deployed, every sensor at first 
covers one of its directions. These sensors structure a directional sensor arrange 
with the goal that information can be assembled and moved to the sink, a central 
handling base station. If a directional sensor appearances to a direction, we state 
that the sensor works toward this path and the direction is the work direction 
of the sensor. When this sensor works toward a direction and a target is in the 
detecting locale of the sensor, we state that the direction of the sensor covers the 
target. Since a directional sensor has a small angle of detecting range than an 
omnidirectional sensor or even does not cover any target when it is sent, we have 
to plan sensors in the system capture specific directions to cover every one of the 
targets. We consider a subset of directions of the sensors wherein the directions 
cover every one of the targets as a cover set. Note that, in a cover set a sensor 
can have at most one direction. The issue of finding a cover set, called directional 
spread set (DCS) problem, is an NP-complete problem (see [1]). 

Power protection is as yet a significant issue in directional sensor network 
because of the accompanying reasons. To start with, most sensors have restricted 
power sources and are non-rechargeable. Additionally, the batteries of the sensors 
are difficult to supplant due to antagonistic or blocked off conditions in numerous 
situations. We assume that every sensor is non-rechargeable and dies when it runs 
out its energy. To moderate vitality, we can leave necessary sensors in the active 
state and put necessary sensors into the sleep state, while keeping every one of 
the targets covered. 

In this paper, our objective is to maximize the network lifetime of a direc- 
tional sensor network, where the network lifetime is defined as the time duration 
when each target is covered by the work direction of at least one active sensor. 
The system is to organize the directions of sensors into non-disjoint subsets, each 
of which is a cover set, and designate the work time for each cover set. 

Note that non-disjoint cover sets enable a direction or a sensor to take part in 
different cover sets. Only one cover set is active at any time and it is alternately. 
When one cover set is active, every sensor that has a direction in this cover set 
is in the active state and works toward this direction, while the rest of sensors 
are in the sleep state. 
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The problem of finding non-disjoint cover sets and allocating the work time 
for each of them to maximize the network lifetime is called multiple directional 
cover sets problem (MDCS). It also is an NP-complete problem (see [1]). 

The main contributions of this paper are as follows. Section2 introduces 
the problem statement and a new mathematical model of MDCS. The solution 
method based on DC programming and DCA, DCA combine with cutting plane 
method are presented in Sect. 3 while the numerical simulation and conclusion 
are reported in Sect. 4. 


2 Problem Statement and Mathematical Modeling 


In this section, the notations, the problem statement of the DCS and the MDCS 
are represented. The following notations are used in our this paper (Fig. 1). 


— M: the number of targets, 

— N: the number of sensors, 

— W: the number of directions per sensor. 

— Am: the m*” target, 1 <m< M. 

— s;: the i” sensor, 1<i< N. 

— dij: the j*" direction of the i*” sensor, 1 <i< N,1<j<W. 

— dij = {am |am € A, Gm is covered by dij, }, 

— 5; = {dij| j =1,--- , Wh}. So, if am € dj; then a», is covered by dij. 

— A= {a1,d2,--- , ans}: the set of targets, 

-S= {81, 52,°°° ae 

—~ D= {dy |t=1,--> ,N,j=1,--- ,W}, 

— [,: the lifetime of a sensor s;, which is the time duration when the sensor is 
in the active state all the time. 


Fig. 1. An example of directional sensor network ([1]). 


Problem statement is presented in [1] as follows. 
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Definition 1. Cover Set: Given a collection D of subsets of a finite set A and 
a collection S of subsets of D, a cover set for A is a subset D’ C D such that 
every element in A belongs to at least one member of D’ and every two elements 
in D’ cannot belong to the same member of S. 


Definition 2. DCS Problem: Given a collection D of subsets of a finite set A 
and a collection S' of subsets of D, find a cover set for A. 


Definition 3. MDCS Problem: Given a collection D of subsets of a finite 
set A and a collection S of subsets of D, find a family of K cover sets 
D,,D2,---,DxK © D for A, with nonnegative weights t,,t2,--- ,tk, such that 
ty ttet-:-+tp > p is maximized, and for each s € S, Sm ls D;|, where L 
is a given positive number. a given positive number. 


Note that |sM D;| shows the quantity of the directions of s that are in Dj, 
where |sM D;| = 0 or 1 since no more than one direction of a sensor can work in 
a cover set. 

In [1] the MDCS is formulated as Mixed (0-1) Integer Programming with 
quadratic constraints. In our paper, a new model of MCDS is introduced in the 
form of Mixed Binary Integer Linear Programming (MBILP). 

Consider a directional sensor network with a set of targets, a set sensors, and 
a set of directions. Each sensor i*” has W directions and an initial lifetime of L,;. 

The directions is organize into K cover sets. The k*” cover set is denoted by 
D,, with the work time t,. A direction d;; can be joint in multiple cover sets. 

Let us set a Boolean variable xj; as 


_ 1 if dij € Dr 
a rf otherwise (1) 


The MBILP problem formulated for the MDCS is as follows: 


max tj) +to+...+tK (2) 


subject to 


kK W 
SO SS vise < Lj V3; es (3) 


k=1j=1 
Yijk <tr + Mo.(1 — vijx) (5) 
ty — Mo-(1 — wage) < vege (6) 
Ww 
So bup 1 VR ES, 4S 1,.,K (7) 
j=l 
» Lik 21 Vam € A, k=1,...,K (8) 
Am Edi; 


digED 
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Lijk € {0,1}, th>0, Yajr > 9, (9) 
where Mp is positive large enough. 

The objective function maximizes the total work time of all the K cover sets. 
The constraints (3) demonstrate the lifetime for every sensor. The W directions 
of any sensor work across all the cover sets underlying lifetime of the sensor. The 
constraints (7) show the selection among various directions of a single sensor, 
at most one direction of the sensor can work in a cover set. The constraints (8) 
indicate the coverage guarantee for each target. For each cover set, each target 
in an absolute necessity be covered by in at least one direction of this cover set. 
The constraints (9) speak to the limitations on the variables. 

We rewrite the problem (2)—(9) in the form of BMILP: 


min{—(c,z)} st. cE X, 
where X := {f €IR?: Ax <b, ex >0, 2; € {0,1}Vj € J}. 


3 DC Programming and DCA 


By using an exact penalty result, we can reformulate the problem (2)—(9) in the 
form of a concave minimization program. The exact penalty technique aims at 
transforming the original problem into a more tractable equivalent DC program. 
Let K := {a € RR’: Ax <b, e > 0, a; € [0,1] Vj € J}. The feasible set of the 
original problem is then X = {x: x € K, x; € {0,1} Vj © J}. The original 
program is rewritten as problem (P), 
min{—(c,z): «© X}. (P) (10) 
Let us consider the function p : IR‘ — R defined by: 
p(x) = So min{x;, Lae}. 
jeJ 
It is clear that p(a) is concave and finite on K, p(x) >0 Va € K and that: 
{e: c€ X}={x:2€ K,p(x) <0}. (11) 


Hence the problem (P) can be rewritten as: 
min{ (c, x): a€ K, p(x) < 0}. (12) 
The following theorem can then be formulated. 


Theorem 1. Let kK be a nonempty bounded polyhedral convex set, f be a finite 
concave function on K and p be a finite nonnegative concave function on K. 
Then there exists no > 0 such that for 7 > no the following problems have the 
same optimal value and the same solution set: 


(P,) a(n) = min{ f(y) + np(y):y € K}, 
(P) a=min{f(y):y € K,p(y) <0}. 
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Furthermore 


— If the verter set of K, denoted by V(K), is contained in x € K : p(y) < 0, 
then no = 0. 


— If p(y) > 0 for some y in V(K), then no = min{ Lw-20) -y € K,ply) < 0}. 
where So = min { p(y) -y € V(K), p(y) > o} > 0. 


Proof. The proof for the general case can be found in [7]. 


From Theorem 1 we get, for a sufficiently large number 7 (7 > 1), the 
equivalent concave minimization problem: 


min{f, (x) := (c,z) + np(a) : 2 € Kh, (13) 
which is a DC program of the form: 
min{ g(x) — h(x): 2 € R‘}, (14) 


where: g(x) = xx (a) and A(x) = —f,(x) = —(c, x) — np(z). 

We have successfully transformed an optimization problem with integer vari- 
ables into its equivalent form with continuous variables. Notice that (14) is a 
polyhedral DC program where g is a polyhedral convex function (i.e., the point- 
wise supremum of a finite collection of affine functions). DCA applied to the 
DC program (14) consists of computing, at each iteration k, the two sequences 
{x* band {y*} such that y* € Oh(x*) and x**! solves the next linear program 
of the form (Px). 


min { g(x) — (x — a” ay): oe R*} & min{—(x sy"): 2 € K}. (15) 
From the definition of h, a sub-gradient y* € Oh(x*) can be computed as follows: 


-—cq,—n if a; > 1/2 
y= (16) 
-—qG +n if a < 1/2 


The DCA scheme applied to (14) can be summarized as follows: 


Algorithm 1 

Initialization: 
Choose an initial point x°, set k = 0; 
Let €1, €2 be sufficiently small positive numbers; 

Repeat 
Compute y* via ( 16); 
Solve the linear program ( 15) to obtain z**1; 
k—k+1; 

Until either |J*+ — 2*|| < ex((ja*|| +1) or |fn(a***) — fy(w*)| < ea(lfn(w*)| + 0). 
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Theorem 2. (Convergence properties of Algorithm DCA) 


— DCA generates the sequence {z*} contained in V(K) such that the sequence 
{f,(x*)} is decreasing. 

— The sequence {a*} converges to «* € V(K) after a finite number of iterations. 

— The point «* is a critical point of Problem (14). Moreover if z* # 4 for all 
i € J, then a* is a local solution to (14). 

~ For a number 77 sufficiently large, if at iteration r we have xj € {0,1} Vj € J, 
then ak € {0,1} je J forall k>r. 


Proof. The proof can be found in [7]. 


To improve the solution we combine DCA and cutting plane method (see 
[8,9] and [10]). 


4 DCA-Cut for Global Solution 


Let us consider the following problem: 


min{x«(z)+c'x+d"y+tp(x) : z= (z,y) € IR” x R*}, (17) 
where 
0 sizeK, 
XK (2) = a otherwise (18) 


is the indicator function on K. 
We set g(z) := xx (z) and 
h(z) = —c? a — dy + t(—p)(z) = —e? x — dy + t\> max{—2,;,2; — 1}. (19) 
j=l 
Thus, the problem is equivalent with a DC program: 
min{g(z) — h(z) : z=(a2,y) € IR” x R?}. (20) 


We define a valid inequality for all point of X from a solution of penalty 
function p on Kk. Let z* € K, we define: 


l2")= {9 6 {L,...,.nprepsl/2} , Ale) ={he.n} \ 4). 


lex(z) =lex(t)= So at SY) (1-2). 


i€Io(z*) t€1, (2*) 


and 


Lemma 1. (see [8]) Let z* € K, we have 


(i) L(x) > p(x) Ve € IR”. 
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(it) L(x) = p(x) if and only if 
(x,y) € R(z") = {(z, y) ek: oes 1/2,9 € Ite’); eo laa e ae) le 
Lemma 2. (see [8]) Let z* = (a*,y*) be a local minimum of function p on K, 


then the inequality 
la» (a) > Lex (2*) (21) 


is valid for all (x,y) € K. 


Theorem 3. (see [8,9]) There exists a finite number t; > 0 such that, for all 
t> ti, of 2* = (a*,y*) € V(K)\S its a local minimum of problem (20) then 


a+ (x) > lex(x*), V(x,y) € K. (22) 


Construction of a Cut from an Infeasible Solution 

Let z* be a solution that is not a feasible point of X such that l,.(z) > 
l.«(z*) Vz € K. In this case, there exists at least one index jo € {1,...,n} 
such that 2%, is non binary. We consider two following cases: 


Case 1: The value of /,+(z*) is not integer. 
As l,+(z) is integer for all z € S, we have immediately: 


l«(z) > p:= [Le (2*)| +1,Vz2€S 
ee (z*) <p. (23) 
In other words, the inequality 
le(2) > p (24) 


is a strictly separate cut z* of S. 


Case 2: the value of /,.(z*) is integer. 
It is possible that there are feasible points z’ such that 1,+(z’) = l,«(2*). If such 
a point exists, we could update the best solution (PLMO01) and also improve the 
upper bound of the optimal value. 

Otherwise, for all z € S, we have |,+(z) > l,+(z*). That is to say, 


L«(z) > L«(2*) +1 (25) 


is a separate cut z* of S. 

We consider below a procedure (called Procedure P) is providing a cut, or a 
feasible point, or a potential point. 

Let us set Ip(z*) := {1 EI : x € {0,1}}. 

Let us set [p(z*) = {te I : af € {0,1}}, then J = I, (2*) U Ip(2*) and 
ile") ip") =. 

We observe that if we can find z! such that l,«(a!) = l,+(a*) and there exists 
i, € Ip and x}, = 1 — 2%;, (there are only two possibilities 2}, = 1 — 2*;, or 
zi =a*;,) then z1 ¢ R(z*). 


al 
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By Lemma 1 we have: 
p(a*) = le«(x*) = lex (a*) > p(x"). 


Step 1. Let us set K, = {2 = (a, y) € K; 2, = @*, Vi € Ep}. 
Step 2. Choose i, € Ip(z*). 
Step 2.1. If i; € Ip(a*) then we solve the linear program: 
(Pmaxl) 2, = max{x;, : z= (x,y) © Ky; ly (x) = ly (a*)}. (26) 
— If%;, =1 then Z € R(z*) and by Lemma 1 we have: 
D(a") = lex (2*) = les (©) > p(Z). 


— If Z;, <1 then z;, = 0 and we update the indices set (z*) = G(z*) U {is}, 
Ip(z*) = Ip(z*)\{is} and 


ki = Ki N{z= (2,y) € K; 24, = 0}. 


— If Problem (Pmaz1) is infeasible then added a cut 1,«(a) > l,+«(a*)+1 in our 
problem. 


Step 2.2. Ifi, € [(a*) then we solve the linear program: 
(Pmin2) 7, =min{a;, : z= (x,y) € Ki; les (2) =le(2")}. (27) 
— If @;, =0 then Z € R(z*) and by Lemma 1 we have: 
P(2") = las (a*) = las (EZ) > p@)- 
— If%,, > 0 then x;, = 1 and we update the indices set (z*) = G(z*) U {is}, 
Tp(2*) = Ip(2*)\{is} and 
ky = kiN {z= (2,y) € K; «;, = 1}. 


— If Problem (Pmin2) is infeasible then added a cut Iy+(x) > ly+(a*)+ 1 in our 
problem. 


From a feasible point «*, we can be add cut I,«(x) > 1 (see [9]) and restart DCA 
on new feasible region. The numerical results show the efficient of our proposed 
method. 
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5 Numerical Simulation and Conclusion 


We evaluate the performance of our approach through simulations running on a 
computer with Core i7-8750 2.2 GHz CPU and 24 GB memory. The optimization 
algorithm is implement in Matlab and it is compared with CPLEX 12.7. The 
data is generated based on [1]. N sensors with detecting radius r = 100(m) and 
M targets are deployed uniformly in a region of 400 x 400 m?. Each sensor has W 
directions. The maximal number of cover sets is equal to the number of sensors, 
i.e, K = N. The initial lifetime of each sensor is set as 1 and W is set as 3. The 
Gapl and Gap2 are defined by the gap between the result of DCA, DCA-CUT 
with CPLEX 12.7, respectively (Table 1). 

In Tables of results, Data, DCA, T-DCA, Gapl, DCACUT, T-CUT, Cplex, 
T-Cplex and Gap2 stand for name of data, objective value of DCA, running 
time by DCA, Gap between DCA and Cplex, objective value of DCACUT, run- 
ning time by DCACUT, objective value of Cplex 12.7, running time of Cplex 
and objective value of Cplex 12.7 and Gap between DCA and Cplex (the global 
optimal value), respectively, where 


Upper bound - Lower bound 


aa Upper bound 


Table 1. Results of DCA, DCACUT and CPLEX 12.7. 


Data |DCA)T-DCA(s) | Gap1(%) | pbcAcut | T-CUT | Cplex | T-Cplex(s) | Gap2(%) 
M = 10 |Datal |- — - 77 0.6 77 5.8 0 
N = 80 
Data2 78 0.36 1.3 78 0.5 79 4.9 1.3 
Data3 77 0.4 0 77 1.6 77 4.1 0 
Data4 78 0.48 1.3 78 1.0 79 6.3 1.3 
Data5 |- = - 77 0.4 78 6.1 1.3 
M = 10 | Data6 | 87 0.5 2.2 87 1.5 89 2.7 2.2 
N = 90 
Data7 |— — co 86 0.6 87 2.5 Id) 
Data8& | 88 0.6 1.1 88 1.13 89 9.1 1,4. 
Data9 | 87 0.6 2.2 88 1.2 89 9.3 1d 
Datal0 89 0.3 0 89 9.5 89 9.7 0 
Datall 88 0.5 0 88 aac 88 12.9 0 
Datal2 | 87 0.4 2.2 88 1.1 89 9.8 ted 
M = 15 | Datal3 | 97 0.5 1.0 98 0.9 98 16.8 0 
N = 100 
Datal4 97 0.6 0) 97 1.4 97 14 0 
Datal5 |— = eo 96 17.6 96 15.6 0 
Datal6 |— ae - 98 0.7 98 13.7 0 
Datal7 97 0.8 1.0 98 1.5 98 6.2 0 
M = 20 | Datals |— — oa 147 Fal 148 8.2 0.6 
N = 150 
Datal9 |— — _ 145 2.9 145 56.2 0 
Data20/ 147 | 2.1 0 147 3.5 147 54.4 0 
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From the numerical results, we observe that: 


— In many cases DCA provides an integer solution after a few number of itera- 
tions. 

— DCA-CUT always provides an integer solution. 

— The GAPs are small. It means that the objective value obtained by DCA- 
CUT is rather close to the global optimal value, almost all GAP is less than 
2.0%. 


In this paper, we introduce a new model of MDCS problem in the form of 
the BMILP problem. An efficient approach based on DC algorithm (DCA) and 
Cutting plane method is proposed for solving this problem. The computational 
results obtained show that this approach is efficient and original as it can give 
integer solutions while working in a continuous domain. From the promising 
outcome, in a future work we plan to combine DCA, Branch-and-Bound and 
Cutting plane method to globally solve the general MDCS problem with high 
dimension and real data. 
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Abstract. In this paper, we present an approach for combining the 
CMAES-APOP with a local search in order to make a hybrid evolution- 
ary algorithm. This combination is based on the information of popula- 
tion size in the evolution process of the CMAES-APOP algorithm while 
the local search is quasi-Newton line search algorithm. We will give some 
conditions to efficiently active the local search inside CMAES-APOP. 
Some numerical experiments on multi-modal optimization problems will 
show the efficiency of proposed approach. 


Keywords: Evolutionary computation - Hybrid evolutionary 
algorithm - Global optimization - CMA-ES - CMAES-APOP 


1 Introduction 


A large number of real-world problems can be considered as multi-modal opti- 
mization problems. Solving multi-modal optimization problems is very impor- 
tant for the decision-making processes. However, finding global or even good 
local solution is a challenge since the objective function may have several global 
solution, or it may have too many local solutions. The CMAES-APOP [11-13] 
is a variant of the well-known CMA-ES (Covariance Matrix Adaptation Evolu- 
tion Strategy) algorithm [8] which adapts the population size in the CMA-ES 
to deal with multi-modal functions. This approach is inspired from a natural 
desire when solving an optimization problem as well as one prospect when using 
larger population size to search: we want to see the decrease of the objective 
function. In this approach, the non-decrease of objective function in a slot of 
S = 5 successive iterations is tracked to adjust the population size for the next 
S successive iterations. This implies that in each slot of S iterations we change 
the population size for searching. Consequently, the variation of population size 
takes a staircase form in iterations. 

In fact, adapting the population size in the CMA-ES seems to be a right 
way for solving multi-modal functions, since the default value of population 
size in the CMA-ES, say A := |4+ 3ln(n)]|, is known to be insufficient [7]. In 
the literature there are well-known and successful strategies for adapting the 
© Springer Nature Switzerland AG 2020 


H. A. Le Thi et al. (Eds.): ICCSAMA 2019, AISC 1121, pp. 64-74, 2020. 
https: / /doi.org/10.1007/978-3-030-38364-0_6 


A Combination of CMAES-APOP Algorithm and Quasi-Newton Method 65 


population size, such as IPOP-CMA-ES [2] in which the CMA-ES is restarted 
with increasing population size by a factor of two whenever one of the stopping 
criteria is met; and BIPOP-CMA-ES strategy [6] in which two restart regimes 
are defined: one with large populations (IPOP part), and another one with small 
populations. In each restart, BIPOP-CMA-ES selects the restart regime with 
less function evaluations used so far. Ahrari and Shariat-Panahi [1] introduced a 
population size adaptation method for the CMA-ES which uses a measure, the 
oscillation of objective value of @mean, to quantify multi-modality of the region 
being explored. This quantity is iteratively updated based on the optimization 
history and eventually used to increase the population size when facing highly 
multi-modal regions and vice versa. In [15], Nishida and Akimoto have presented 
another population size adaptation strategy for the CMA-ES that is based the 
estimation accuracy of the natural gradient. 

In this paper, we introduce a new method for combining the CMAES-APOP 
algorithm with a local search to make hybrid evolutionary algorithm. This 
method is based on an important property of CMAES-APOP: it tries to increase 
the population size when recognizing the roughness of the objective function in 
order to allocate the position of optimal/good-local solution; whenever a such 
solution is detected, the population size is gradually decreased until the algo- 
rithm converges. The local search which will be used is the quasi-Newton line 
search. The motivation for this work is that integrating local search algorithms 
into a population-based algorithm sometimes helps to improve its performance, 
for continuous/combinatorial optimization problems [3-5,9,10, 14,16]. 

The rest of paper is organized as follows. In Sect.2, we present briefly the 
CMAES-APOP algorithm and some motivations of the proposed method. The 
method of combining local search algorithm with the CMAES-APOP is given 
in Sect. 3. Numerical experiments are reported in Sect. 4 while some conclusions 
and perspectives are discussed in Sect. 5. 


2 The CMAES-APOP Algorithm 


The CMAES-APOP [13] is a variant of the CMA-ES which adapts the population 
size in the CMA-ES to deal with the multi-modal functions. In the following, we 
give a short presentation of the CMAES-APOP (for more details, see [11,13]) in 
Algorithm 1. Here, some notations were used: 


— iter: number of iterations. 

— S$: number of iterations in each slot. 

— fmed := median(f(x;.,),7 = 1,...,4): the median of objective function of yu 
elite solutions in each iteration; f™t and f™°* denote the medians in the 


prev cur 
previous and current iteration respectively. 
~ Mup: the number of times “f2e4 — fumed > 0” occurs during a slot of S' itera- 


tions. 
— typ: the history of nyp in each slot recorded. 
— NOyp: the number of most recent slots we do not see the non-decrease. 
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The lines 7-14 in Algorithm 1 gather information of nup during S = 5 itera- 
tions in the CMA-ES. After each S$ iterations, the lines 21-37 adapt the popula- 
tion size to the next S iterations based on the information of nyp and its history. 


Algorithm 1. CMAES-APOP algorithm 

1: Input: m € R",o € Ry 

2: Initialize: C = I, pe = 0, pe =0,A = kn X Aaefault 

3: Set: w = |A/2|,wi = log(w + 0.5) — logi,t = 1,...,h,p4w = 


Le FI we Ce = 


2(uw 2+) 

4 ee = 2 = 0 = 
n+4? Co n+ Mw +3? © (n+1.3)2+hw? “H (nt2 Fu 7 + en S Lde = 
1+ 2max (0. pe 0,5 = 5,rmax = 30, nup = 0, tup = [ J. 


4: while not terminate 

5 iter = iter + 1; 

6: xi =m-+oyi,yi ~ N(O, C), for? =1,...,A 
qe 

8 


if iter = 1 

; ie <— median(f(x::,),7 = 1,..., ) 
9: else 
10: med ._ median(f(xi:x), 4 = 1,.-; 14) 
11: if fmed — Bee >0 //Check if ff"! increases 
12: Nup = Nup + 1; 
13: end 
14: end 
15: fed pees //Reset fi, 
16: m<— Sof, wiXi:a = M+ 0Yw, where yw = oh, Wiyi:r 


Iv: ee (- (1 ce)? )?) ./HwYw 
18: Po «— (l—co)po t+ J (1 Co )?)\/ Bw C~ *¥w 


19: C+ (l-a seve | ae t+ Cy ot WiYi:rYer 

20: oo xexp ($ (a hPa -1)) 

21: if (mod(iter, S) = 1) & (iter >1) //Adapting the population size 

22: tap: = tea asl: //History of nup 

23: if mup > 1 

24: AG min (exp | ) Yee) x A] ; 
S-y/A—Adefault +1 

25: o —o X exp (4 (732 - z))3 //Enlarge o a little bit 

26: elseif nup = 0 

Q7: NOup = length(tup) — max(find(tup > 0)); 

28: if aN > 2Adefault 

29: A <— max (|A x exp(—noup/10))| , 2Adefauit) ; 

30: end 

31: end 

32: if \ is changed //Only when mp > 1 or muy = 0 

33: Update 1, wi=i...p, bw W.r.t the new population size » 

34: Update the parameters Cc, Co, C1, Cu, do 

35: end 

36: Nup — 0 //Reset Ny back to 0 


37: end 
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The population size will be enlarged if nup > 1, and reduced if nup = 0 (ie., 
there is no “going up” during S iterations) while it is not changed ifn, = 1. The 
threshold of 1 is required for ny, to decide about the increase of the population 
size because apart from the roughness of objective function, the (randomly) sam- 
pling process may also affect on the information of nyp. It is worth noting that 
the median used to measure the non-decrease signal makes the CMAES-APOP 
algorithm invariant to scaling and shifting operators on the objective function. 


‘ A = J j = 


200} | 200 | ie 
| ee ee ee ee on (peer! eee 


° 50 100 150 200 250 300 0 50 100 150 200 250 300 360 
Iterations Iterations 


Fig. 1. Adapting the population size of CMAES-APOP in 20-D (Left: Rastrigin func- 
tion, Right: Scale Rastrigin function). 


When testing the performance of CMAES-APOP on some multi-modal func- 
tions with the conditions as in Sect. 4, and with the small initial population size 
A = Adefaut (ie, set Kk, = 1), and without the upper bound for the population 
size in the dimensions n = 10,20,40, we obtain high success rates (more than 
80%) for all tests (see [11] for the details of experiment). Moreover, we found an 
important property of this algorithm: it tries to increase the population size to 
allocate the position of optimal/good-local solution; whenever an such solution 
is detected, the population size is gradually decreased until the algorithm con- 
verges. For instance, Fig. 1 shows the adaption of the population size in average 
over successful runs of the CMAES-APOP in the dimension 20. 

This property and the high success rate in the experiment suggest us that it 
is possible to insert local searches during the process of CMAES-APOP when 
the algorithm locates the position of optimum to speed it up. In the next section, 
we will propose an approach to do that. 


3 Combining Local Search with the CMAES-APOP 


Using local search inside a global search algorithm is not too new. In the lit- 
erature, there are several successful approaches using this technique for contin- 
uous/combinatorial optimization problems [3-5,9,10,14,16]. Nevertheless, how 
to use efficiently a local search strongly depends on the characteristics of global 
search algorithms, and we have to balance the exploration-exploitation trade-off 
in such frameworks. We normally have to answer the following questions: 
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— When/where should a local search be used? 

— Which type of local search should be used? and what is the stopping condition 
for the local search? 

— How many times should the local search be used? 


In the Algorithm 2, we give a description of the combination of CMAES- 
APOP with a local search algorithm. It differs from the Algorithm 1 at only 
one point (lines 8-11 in Algorithm 2): it tries to use local search when some 
conditions are satisfied. 


3.1 When/where Should a Local Search Be Used? 


During the evolution process, we expect whenever the CMAES-APOP detects 
the position of optimal solution we will active a local search to get a quick 
convergence. Therefore, we will use a local search if the following conditions are 
all satisfied: 


— The first one is based on the is LS(hist_popsize) function (Algorithm 3) 
which can be described as follows: we use the information of population size 
in the last 40 iterations which corresponds to the array hist of 8 different 
population sizes (since the population size has a stair-case form in each 5 
iterations). Then we fit the 8 data entries of hist with a quadratic function 
in the form pl x t? + p2 x t + p3. The first condition holds if and only if 
pl < 0, that is the population size is making a (locally) concave quadratic 
form. 

— The second condition: the local search will be used after max(4 x n, 40) iter- 
ations, where n is the problem dimension, since we need the population size 
to be quite large to spot the location of optimal solution. 

— The third condition: at the current iteration the algorithm generates a better 
candidate than the best one recorded so far. 


In this work, we limit the number of times using the local search: maxzg := 5. 


3.2. The Quasi-Newton Line Search Method 


In this paper, we use the quasi-Newton line search (denoted by qN for short) as 
local search procedure. The qN is chosen since it has a super-linear convergence 
rate. 

In optimization, the descent methods are classical techniques for exploring 
the neighborhood of the point Xpest-so-far- These approaches start from a point 
Xz, construct a descent direction d;, then find an approximate solution for the 
line search problem: 

minimizey>of (xx + adx). (1) 


When an approximate solution a* of the problem (1) is found, we obtain a 
new point X,41 = xp +a*dx. Then, the previous procedure is repeated from that 
point. The gradient descent method uses the direction d, = —V f(x), while the 
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Algorithm 2. CMAES-APOP algorithm using local search 


1: 
2: 
3: 


Input: m € R",o € Ry 
Initialize: C = I,p.e = 0,po =0,A = kn X Adefault 


Set: uw = |A/2],wi = log(u + 0.5) — logé,é = 1,..,4j,4w = Srionice = 
#=1 "4 
1 
4 bw +2 _ 2 _ 2tw-2+ 75) = 
na OT = nt pw +3? © (n4+1.3)2 hw? (nt ty, 7 +H Side = 1+ 


2 max (0, (a3 :) + Co, iter = 0,S = 5, rmax = 30, nup = 0, tup = [ ], hist-popsize = 
[],72Ls = 0, maxzs = 5. 


: while not terminate 


iter = iter + 1; 


hist_popsize = [hist _popsize; AJ; //History of the population size 
xi =M+ yi, yi ~ N(O, C), fori =1,...,r 
if (iter >= max(4 * n,40)) & ((mod(iter,S) == 1) || (mod(iter,S) == 3)) & 


is_LS(hist_popsize) & (Ff (X1:a) < foest-so-far ) & (nus < maxzs) 

e Do a local search with the starting point x,.). If the local search can find 
a better solution xPest) than x,,,, then we update x1,, << xpest, 
Especially, if the local search finds an optimal solution, the whole 


algorithm is stopped. 


ents =nzgt1; //nis is number of times using local search 
end 
if iter =1 
med <— median(f(x;-,),7 = 1,..., 4) 
else 
aed .. median(f(x;:,),4 = 1, ny 7) 
if fmed — med > Q //Check if f™¢ increases 
Nup = Nup + 1; 
end 
end 
cole aa ri //Reset forey 
ma Fo 1 WiXi:A = M+ 7Yw, where yw = a ea 


Pe — (1 — Ce)Pe + 1p. ci. eal (1 - ( ey ) Eww 
Po — (l—co)po + JV (1 Co )?)./fiw C~ 2¥w 


C— (1-c1— ey )C 4 eee + Cu doy WiVEAV EA 
= [poll 
oo xexp (5 (arin ~1)) 
if (mod(iter,S) = 1) & (iter > 1) 
tup = [tup; Rup]; //History of Nup 
if nup > 1 


: Nup-(4+3 log(n)) 
Ac min (ex (2s a testo) rons x df ‘ 
a-—oxX exp (4+ (7 a =)); //Enlarge o a little bit 
elseif nup = 0 
nOup = length(tup) — max(find(tup > 0)); 
if A > 2rdefault 
A — max (|A X exp(—noup/10))| , 2Adefault) 3 
end 
end 
if \ is changed //Only when mup > 1 or nup = 0 
Update p, Wi=1...4, Uw W.r.t the new population size A 
Update the parameters Cc, Co, C1, Cu, do 
end 
Nup < 0 //Reset Nup back to 0 
end 
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Algorithm 3. The is_LS(hist_popsize) function 


1: function y = is_LS(hist_popsize) 


2: y =0; 
3: hist = hist_popsize((end-40):5:(end-1)); 
4: popsize_model = fit((1:length(hist))’, hist, ‘poly2’); 
//Fit *hist’ with a quadratic function: plxt?+p2xt+p3 
5: if (popsize_model.pl <0) //popsize_model is concave quadratic 
6: y=1; 
UG end 
8: end 


Newton’s method takes d; = —[V?f(xz)]~'Vf (xz), where V?f(x,) is the Hes- 
sian matrix of f at x,;. Although Newton’s method has a (locally) quadratic con- 
vergence rate, computing V7 f(x,) requires a large amount of computation. The 
quasi-Newton methods avoid this disadvantage by using the observed behav- 
ior of x and V f(x) to make an approximation to the Hessian matrix. These 
methods normally have a super-linear convergence rate. Among many updating 
techniques, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula is thought 
to be the most effective one: 


T T yT 
qed Ays,8;, 
Hogi kk 


(2) 


T T 
Qi. Sk 8; HrSr 
where 7 refers to the transpose and 


Sk = Xk4+1— Xk, (3) 
dk = Vf(xe+1) — VE (xx). (4) 


We can choose the starting matrix Ho to be any symmetric positive definite 
matrix, for example, the identity matrix. The source code of the quasi-Newton 
line search is provided in the Matlab function fminunc. In fact, we will use this 
Matlab function with the option of quasi-Newton algorithm and with the default 
parameters. 


4 Numerical Experiment 


In this section, we test the performance of proposed algorithm on some test multi- 
modal functions. We used the matlab implementation of CMA-ES!, version 
3.40.beta to make the proposed algorithm. The experiment is tested on a Mac- 
Book Air Intel(R) Core(TM) i5-5250U, CPU 1.60 GHz, RAM 8G. Table 1 sum- 
marizes the unconstrained multi-modal test problems while the initial parame- 
ters for the algorithms are given in Table 2. 

All considered functions have a large number of local optima, are scalable 
in the problem dimension, and have a minimal function value of 0. The known 


" https: //www.lri.fr/~hansen /cmaes20091024.m. 
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Table 1. Test functions. 


Name Function 

Rastrigin Frastrigin(x) = 10n + 307, (a7 — 10 cos(27#;)) 

Scale Rastrigin | fRastScale(x) = 10n + 32", ((10%=7 2)? —10 coat SrlOn=t Zi)) 

Schaffer Fachatter(x) = fly (#7 + e741)??? [sin (50(e7 + 2741)°") +1 

Griewank fariewank (x) = 7500 ae oi — TTi=1 608 (3) 2 

Ackley fackley(x) = 20 — 20- exp (-o2/2 52, 2?) +e — exp (q 7h, cos(2mas)) 
Bohachevsky | frohachevsky(X) = fay (#7 + 2#7,, — 0.3. cos(3ma;) — 0.4 cos(4rxi4+1) + 0.7) 


global minimum is located at x = 0. The performance of proposed algorithm 
is tested for dimensions n = [10, 20,40]. The bound constraints for the Ackley 
function in [—32.768, 32.768]” are considered via quadratic penalty terms. That 
is fackley(X) + >0;_, A(|xi| — 32.768).(|x;| — 32.768)? will be minimized, where 
(x) =1lifa>0 and O(2) =Oifa <0. 


Table 2. Initial conditions. 


Function Initial point oO 

Rastrigin x?-= (5; ...,5) 2 
Scale Rastrigin x° = (5,...,5) 22 
Schaffer x° = (55,...,55) | 220 
Griewank x° = (305, ...,305) | 100 
Ackley x° = (15,..., 15) 25 
Bohachevsky | x° = (8, ..., 8) 23 


There are 51 runs for each function in each dimension. Each run is stopped 
and regarded as successful if the objective value is smaller than fxtop = 107° 
(fstop = 10~® for the Schaffer function). We need some additional conditions 
for the Schaffer function: TolX = 107°°, TolFun = 107-9, TolHistFun = 107?°. 
We denote CMAES-APOP using quasi-Newton method as a local search by 
CMAES-APOP-qN. The default parameters for fminunc function in Matlab is 
applied for the qN method. The performance of algorithms is measured by the 
average number of function evaluations to reach the target. 

Table 3 shows the comparative results provided by the CMAES-APOP with 
its modified version combining the local search qN. We also compare these algo- 
rithms with the well-known IPOP-CMA-ES algorithm [2]. In this table, we take 
directly the results of CMAES-APOP and IPOP-CMA-ES from [11]. We observe 
that, in general the CMAES-APOP with local search does not always bring bet- 
ter performance. For example, on the Schaffer and Ackley functions, the inte- 
grated algorithm appears to be inferior. This is because the vicinity of optimal 
solution of this function is rugged; thus the proposed local search is not capa- 
ble to find the optimal solution even though the CMAES-APOP has located its 
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Table 3. The comparative results between CMAES-APOP, CMAES-APOP-qN and 
IPOP-CMA-ES (n: problem dimension, SR: success rate, aRT (average Running Time) 
= number of function evaluations divided by the number of successful trials). 


Function n |CMAES-APOP CMAES-APOP-dqN | IPOP-CMA-ES 
SR |aRT SR |aRT aRT 
Rastrigin 10 | 0.86 | 3.317e+04 0.90 | 2.328e+04 7.471e+04 


20 | 0.94 | 9.022e+04 0.94 | 7.338e+04 2.594e+05 
40 | 1.00 | 2.981e+05 1.00 | 2.709e+05 7.744e+05 
Scale Rastrigin | 10 | 0.80 | 3.527e+04 | 0.94 | 2.204e+04 8.498e+04 
20 | 0.80 | 1.111e+05 | 0.96 | 7.122e+04 2.794e+05 
40 | 0.90 | 3.427e+05 0.94 | 2.716e+05 9.179e+05 
Schaffer 10 | 0.96 | 3.098e++04 | 0.96 | 3.298e-+04 4.801e+04 
20 | 0.94 | 8.175e+04 | 0.96 | 9.189e+04 1.285e+05 
40 | 0.90 | 2.255e+05 0.90 | 2.591e+05 3.206e+05 
Griewank 10 | 0.98 | 1.215e+04 | 1.00 | 7.659e-+03 7.192e+03 
20 | 0.98 | 2.479e+04 0.98 | 1.752e+04 7.170e+03 
40 | 1.00 | 5.769e+04 1.00 | 3.795e+04 1.186e+04 
Ackley 10 | 1.00 | 1.403e+04 1.00) 1.471e+04 1.890e+04 
20 | 0.98 | 3.105e+04 | 1.00 | 3.252e+04 1.249e+05 
40 | 0.92 | 7.204e+04 0.94 | 7.379e+04 1.162e+06 
Bohachevsky | 10| 1.00 | 1.002e+04 | 1.00 | 6.507e+03 5.947e+03 
20 | 1.00 | 2.397e+04 1.00 | 1.639e+04 1.813e+04 
40 | 0.98 | 5.536e+04 | 1.00 | 3.994e+04 4.537e+04 


neighborhood. Nevertheless, we see that, on Rastrigin function, CMAES-APOP- 
qN runs significantly faster, about 1.25 times in average over 3 dimensions, 
than CMAES-APOP does. Especially on the Scale Rastrigin function, CMAES- 
APOP-dqN runs significantly much faster, about 1.47 times, than CMAES-APOP 
does. In addition, CMAES-APOP-qN gives better performance than CMAES- 
APOP does, about 1.5 and 1.46 times on the Griewank, Bohachevsky functions 
respectively. Besides, CMAES-APOP-qN is slightly better than IPOP-CMA-ES 
on the Bohachevsky in the dimensions 20 and 40. 


5 Conclusion 


In this paper, we have presented an approach for using a local search, say the 
quasi-Newton line search method, inside the CMAES-APOP. The proposed algo- 
rithm is tested on some benchmark multi-modal functions. The numerical results 
show that this approach can improve the performance of CMAES-APOP in some 
cases. For the multi-modal functions having regular structure around the optimal 
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solution, using qN as local search may bring a significant improvement. How- 
ever, for the multi-modal function having rugged structure around the optimal 
solution, using qN as local search is insufficient and that may lead to the waste 
of function evaluations. Combining with another local search to overcome this 
drawback will be investigated in the future. Also, we will test our algorithm on 
other test problems. 
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Abstract. In this paper, we provide an exact reformulation of Non- 
smooth Constrained optimization Problems (NCP) using the Moreau- 
Yosida regularization. This reformulation allows the transformation of 
(NCP) to a sequence of convex programs of which solutions are feasible 
for (NCP). This sequence of solutions of auxiliary programs converges 
to a local solution of (NCP). Assuming Slater constraint qualification 
and basing on an exact penalization, our reformulation combined with a 
nonconvex proximal bundle method provides a local solution of (NCP). 
Our bundle method allows a strong update of the level set, may reduce 
significantly the number of null-steps and gives a new stopping criterion. 
Finally, numerical simulations are carried out. 


Keywords: Proximal algorithm - Bundle method - Nonconvex 
optimization - Nonsmooth optimization - Reformulation 


1 Introduction 


In this paper, we consider nonsmooth Nonconvex Constrained Problems (NCP) 
which may be stated as 


min f(x) 
(NCP) st. f(z) <0, ge J=({l,...,m} (1) 
xEX. 


where: m, n are integers, f/ : R" — R; for 7 € JU {0}, are lower-C?; and X isa 
bounded polyhedron of R”. 

The classical bundle method is inspired by Kelley’s cutting-plane method 
proposed in 1960s [1] and can be applied to both convex and nonconvex opti- 
mization problems. For more details, we refer the reader to [2,3]. 
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Two major practical issues in the bundle methods are the question 
of limiting the bundle size and reducing the number of iterations of 
null-steps. 

We introduce a new reformulation that allows designing a proximal point 
algorithm for our problem (1). Consider the following assumptions. 


Assumption 1. The objective function and constraint functions of (1) are lower- 
C? on the same open and bounded set O which contains X. 


Assumption 1 ensures the existence of 7 such that functions f? + ].|, for 7 = 
0, ..., mare strictly convex and thus allows to make a convex reformulation for (1). 


Assumption 2. The initial point is not a local solution for (NCP). 


When the point initial is a local maximum, the proximal point method may fail 
to find a descent direction. In the nonsmooth nonconvex case, it may not be 
easy to show that 0 € Of(#). And, to avoid being tempted to check that the 
considered point is a local maximum (i.e. 0 is a subgradient), it is better to start 
from an infeasible point. 
Assumption 2 ensures global convergence (not depending on the initial point). 
Moreover, since (NCP) is nonconvex, we assume that: 


Assumption 3. The feasible set of problem (1) has a non-empty interior. 


From the lower-C? property, this Assumption allows having a descent direction 
from any feasible solution (which is not a critical point of (1)). Therefore, it 
ensures to have a KKT point of the primal problem (NCP). 

The method proposed in this work is a new proximal point method for con- 
strained nonsmooth nonconvex optimization. The new method is combined with 
a bundle method to design a new triple stabilized bundle method. 

This type of problem has been recently studied by Yang et al. in [3] and a 
nonconvex bundle method is proposed. Thanks to an exact penalty function, in 
[3], the problem is transformed into a parametric unconstrained problem whose 
local convexity requires a more stringent special Slater constraint qualification. 
In the nonconvex bundle method, this local convexity is necessary to have a local 
solution. 

In this paper, we present a new reformulation of the problem that guarantees 
convergence and then, takes advantage of the known results (such as those of 
bundle methods) on unconstrained convex problems to solve (NCP). In fact, 
(NCP) will be converted to a parametric unconstrained optimization problem 
that is convex for sufficiently large parameter tactically detected by a nonconvex 
bundle method. 

The paper is organized as follows. Under certain assumptions, a bundle 
method is described in Sect.2. In Sect.3, we provide numerical results for the 
proposed algorithms applied to a problem in wireless cellular networks. We end 
the paper with our conclusion and some discussions in Sect. 4. 
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2 A Nonconvex Nundle Method 


This section presents a proximal bundle method for our problem of interest 
(1). All results that will follow, will be under the assumptions (1), (2) and (3) 
presented in Sect. 1. 


2.1 Work Model 


For a point Z € R” and the scalars c > 0 and 7 > 0, let us define, for the 
optimization problem (CP ;), the exact penalty function by wz: R” — R, 


Pema(@) = f9(«) + Sle — al? + cFeq(2)* (2) 


where: Fy», (a)* := max {0, fi(x) + %|a¢ —a2*)?,..., f(x) + |e — Z|? } 


Remark 1. This penalty function is better than the one given by 


#90) +emax 0, f(x) + B,..., P™(a)} + nS 4) — ap 


which penalizes too much the steps of descents. 


Suppose we have generated the points y’, i € I, = {1,...,k}, where «* = y* 
is the last obtained and that ~.,,2(y’) and v’ € OWenz(y’) have already been 
computed. We consider the following piecewise linear function (or model) 


Wo (x) = max{ ter mo (Y") + (v*, v— y")} (3) 


where cz, grows and 7, is updated such that q., », 28 is convex on Cl ener ale 
This model can be reritten as: 


Vapaeiah (x) = eg ag ge” 
max {Vormea(y") + (v4, 2! — y!) — Vox ma(a) + (v',a—a*)} (A) 
Thus, 


oe (x) _ Denne a (Y") = max { — Gen nk (a; y’) tr a so at, (5) 


where: Oren. (@*, y*) a Pex.nna(2*) _ [Denne t(y’) + a _ y’)] is the lin- 


earization error between y* et x”. Since We, .,,z is convexe on {y!,...,y*} and 
v" € OWex nn z(y’), for i € Ip, then we have 


Vex nn,t(L) 2 Dex nn aly") a waz = y"), VareE {gy say tf fs (6) 


In particular for 2 = x* in (6), (2.1) gives us 


Aen Nk (2*,y’) >Oand — Wen Ne (2*,y’) = Den tm or (2") _ Den nea (y*) <0 (7) 
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Now, we consider the function 


Den ne.e(2) = max { — Wen 1 (*, y') + we, i ght (8) 


and at iteration k, the next point generated is the solution of quadratic program 
(QOP;,) defined by 


min Pox,m,a() + 4||a — w*|/? 
(OP) 8.t. Wex,nn.,a(2) < le (9) 
rex 


Let us write more explicitly the terms in (OP,). For 7 € Ix, let 


FE A(maxt ULV), Feary’), sf = {5 Hye)” =° 


dk =|yi—at|P/2 and AR = yi — ak 


For f? and Fy + = max{ f*, f?,-.,f™}, the respective linearization errors e% ; 
and E', ,; between x* and y’ are defined as 


eo, =P (a*)-— Py) —- gi a*-y), ERa= F(e)+—Fily+ (10) 
+ (6Ft'+, 2* — y') 

It is known that if the function f° is convex on the line segment joining x” and y’, 
then e% ; is positive and g’ € Oe: f(a"). The functions f° and f/ are nonconvex 
but f° + 4£|-—2*|? and fi + 7£|-—2*/? are locally convex so that the previous 
relationship can be applied. We point out that all functions involved in our 
problem of interest 1, i.e., f° and f’, are regular. Thus, one can check that, for all 
ie Ty, oft’ € OF, 2(y')4 We" € O(maxf{fi(y'),... F"(y)}), 8 = 9 +MAF + 
cpol (t* + mA?) € OW, agent (), Cen Nk a, y’) = ef + crE®, tne (1 oF OP yd 

Therefore, we can be rewriten (OP;) as 


ming y+ Fille — a*||? 
(OP x) s.t. — (ef + crEE, i) = me (1 + cp0") dk 


(9° + nly’ — *) + cy.6F (t) + m AP), x —y*) Sv, i € In(11) 
ysl, vER, cEXx. (12) 


We have this analogous result of Proposition in [4,5]; with an appropriate choise 
of 1;,. Let’s recall that: 

1. X is a polyhedron of R™, so as in [4] we can determine the real numbers a* 
and (, as a solution to a minimization of a quadratic function on the unit 
simplex of R!7«!+1+ where p is the number of faces of X; 


A Triple Stabilized Bundle Method 79 


2. Weyn,,z becomes convex whenever the convexification parameter 
Nk = = max{ijo, .-.,Im}- 


Let: 


me—{iel : a®>0}, eF= ef + ChE, is e* = oe ake, (13) 


ie rect 
s* = S° ak (g' + mAk + cdf (t' + mAS)) = D> abs, (14) 
weTpct ieIgct 
dé’ = >° abd?, dhric= D> ofikdf, A*= >) of Al, (15) 
ieIpet ieIpet ie Ine 
en =e" + me(d® + Cede pict): (16) 


Define the aggregate piece 
Denstinsio (x) = Wine sie ck Qe") a (s* ale vt, ct fo) Va € R” (17) 
and the aggregate error Z% = We, m,,0h(2") — De tao (x*). Let us set 


k+1) the real progress 


Za = Weg mg wk (x*) _ Weg pak (y 
= Denne a(L) _ Vins gly the expected or predicted progress 
(related to the solution) 


el = Dey mek (x*) —I,, the expected progress related to level parameter I; 


Proposition 1. [fm > 1 then, 


a® oF € Oe iaest® Gr) and s*¥ +u* € Wer meat (y***) 
gk = v* e Oe, Pen cigar 2"), where: Ek = ef — in |S" a vf ll? Es 0 


go to 


= a (y**1) + (s* + a’ vt — yr) (18) 
= Weg mor (2") + (s* + 0%, a — 2*) — &, Va € R”. 


Points 2 and 3 are analogous to Proposition 2.3 from [5]. 

In the nonconvex case, the linearization error may be negative. A convex cut 
can remove one portion of a feasible set (containing an optimal solution) and 
also lead to the emptiness of subproblem’s feasible set. To correct this, some 
methods use the absolute value of the error to generate cuts. Here we make as in 
the “classical” proximal bundle method, and make sure to update correctly the 
convexification parameter 7x. In [3], the linearization error is é* + nd, here it is 
ef + me (1+c¢,5")d¥ (ef instead of é? because our penalty functions are different). 
We have: 
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g + AE + crOk(t + MAP) € Ook 4m (1tenshyat Pe(w*) 
if ek +m(1+od*)dF>0 Viek. (19) 


min 


We can ensure that the error is nonnegative by checking that n, > 7""" where 


k _ (ok k 
min _ AX Te; _ max (ef + ChE, i) 
oS gene 1+c0%)d& teh, at 1+ cP) dé 
i€I_, af AO, + CK0;") G; i€Ip, d? £0, + CRO; ) a; 
ef +n (1t+e.57 af <0 ef tne (1t+e.57 df $0 
(20) 


At the k — 1 th iteration, the past information to update the model ~., », cr is 
stored in Bf; U Bq, defined from 


Bf = { fy), Fe any ys g € of(y'), ort! € OF ae (ae) 2 0e x} (21) 


Boy = {at = |ly'—2*|P/2, AP =y'-2* : ie hh. (22) 


Proposition 2. 1, d*** = d* + ||c*t) — o*||?/2 — (AF, at! — 2) 
g. ARt = Ab + gh — htt 
8. ef = a8 + fle) — fa*) — (ot, ah — 2") 
k+1 k+l = 

4. Epp é = EE. i + Ere k + EP, , where: 

Bh, = KE max (Pa), P24} 

— (KF max { f(y’), F™(y')} + (KEE 2 — 9')) 
and KF is given at by (23). 


(23) 


KR 1 if Fy(y*)t <0 and Fysi(y’)t > 0 
+ 10 otherwise, 


2.2 Convergence 


From a certain iteration ko, the convexification parameter 7; does not change. We 
can, therefore, analyze Algorithm 1 as if it deals with a convex optimization prob- 
lem. The convergence of proximal bundle methods requires that {1; } be bounded 
below (see [4,6]). And, through the update rule in step 12. of Algorithm 1, these 
requirements are met. Let 7 be the smallest scalar such that f? + |. — 2|?, for 
j € {0,1,...,m}, is strictly convex on a bounded set that contains F((NCP)) 
and the sequence of optimal solutions generated by Algorithm 1. Denote by |-| 
the integer part of a scalar. 


Proposition 3. From an initial value no of n, with no < 7, no more than 
pw ~ log(no) 
log(Ii,) 


| + 1 iterations of convexification should be required (see 24). 
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The relations: s* + v* € Oe, Wegmy2e(@*) and s*+v* € Othe, my or(y*t!) (#x) 
(see statement 1. in Proposition 1) allow getting the convergence result. The limit 
point «* of the sequence x* satisfies 0 € Oey ,.0*(a*) and, in view of (*), 0 € 
OWes np,x* (@*). Since the penalty parameter c;, tends to +00, then a local solution 
to the original problem (1) is obtained. 


Algorithm 1 


iF 


ano» Fk Wwhd 


13. 


14. 


Select mi, my € J0,1[, Ic > 1, I, > 1, the tolerances tola, tolg, tole, tolz, and 
the penalty parameter co. 

Set k = 1, I, = {1}, an initial point z' = y', compute g' € Of°(y'), dit’ € 
OF, «1 (y")+- 

Set dt = 0, At = 0, select 4 and 1, compute f°(z') and F 


1,21(t")* update 
Bf U Bqx (given in (21) and (22)). 


. Set Wie’ = —o0, select e4 > 0 

. Set ee = Wem ,el (2*) — plow 

. If ef < tola and Fy ok (2*)* < tolr, stop (2* is a solution). 
. Set lp = —el. 

. If (QP,) has no solution, set 


BAe" = Veg maak (@") + le, ek = —(1— mui)la, go to step 3. 


. Solve the quadratic subproblem (QP) (11) to get (y*t*,v441) and the multiplier 


Gx associated to level constraint, set \7,¢ pact af =Be+1, ef =—ve41 
k 


(IIs* + "||? = pélly’** —a*|?), Ex = ef — 1/palls® + 08 ||”. 


. Update the convexification parameter, set 


Te+1=Mk if me > me” (given in (20)) (24) 
Nk+1 = Ine otherwise and go to step7. 


. If& < tole, ||s* +v"||? < tol, (s* +v* is an e-subgradient of Wer ny wk at y*** (see 


part 2 of Proposition 1)) and Fy, yx+1 (y***) < tole, stop (y**" is a solution). 


. Set ceti = Iecr. 
11. 
12: 


Compute We, in, xk (y***), set wes = view. 
If Pepe yeti (yet) < te, mp oe (@*) — my ef, set pera = ux (Oe + 1). 
a = ae E41 = min {ek, (1 =. m) (nares ie") ~~ vers) t (25) 


. k+1 k U U 
Otherwise, set x = 2", bk+1 = Mey Net+1 = Nhs Ex4i = Ek 


Set Bfri1 U Bax+i: compute gt? € Of o(y**t), dFttt € OF ccneeere*") 


en, Be djt+!, A®** (given in 1 - 4 of Proposition 2). 


Update the model, set Yo. ngyy2kti(@) 2 max{e, n,,2k (©), Vex .ny,2k (Z)} (see 
(17)), increase k by 1, go to step 3. 
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3 Numerical Experiments 


The bundle method is implemented in python [9] using CPLEX 12.5 [8]. The 
obtained solutions are compared with those obtained with Couenne [7] coupled 
with Pyomo [10]. Through this implementation, we compare different versions 
of the algorithm (named proximal-level) according to the rules for updating pz 
(named prozimal) and rules for a, and b;, (named level). 

For proximal, we have special (sp): corresponds to the rule of uz, based on 
Lagrangian formulation of Hestenes and Powell, natural (na): the rule where uz 
is multiplied by a constant I, > 1 each time we have a null step, constant (co): 
[4p remains constant (tightening the bounds a, and b, is enough to improve the 
proximity with the last serious-step), classical (cl): is set as the maximum of the 
dual variables associated to the constraints (except the level constraint), without 
(wt):uz, = 0. 

For level, we have classical (cl): the model includes the update rule of a, and 
by, without (wt): aj, does not change. 

The without-without and constant-without are not considered. The update 
classical of zx, used in Algorithm 1, is done when we get a serious-step. For the 
special, natural and constant rules, there is an update when we get a null step. 
This allows, when ju, increases (in special and natural), to decrease the descent 
step from the next iteration. ‘without-classical’ is a Kelley cutting plane method 
which includes the update. 

For the different versions of the bundle method (4, 4 0), a condition will be 
added to the one given in Algorithm 1 in step 8., it’s: bk > —e€x. 

In the case where level = classical and uz, 4 0, we add to the stop condition in 
step 8., the condition: by — ax <e, and by > —ex. 

And when pz = 0 (ie. proximal = without), we consider only the next stop 
condition: b; — az <e, and by > —e,. 

The prozimal-level versions of Algorithm 1, where prozimal 4 without, use the 
same parameters. For versions proximal-level where (tu, = 0), except & update, 
we always consider the same parameters. When pz 4 0, €% becomes 


1 
Ey = ef — —||s* + v*||? >0 (see Proposition 1). 
Hk 


For the case where juz = 0 (ie without-classical) which corresponds to our version 
of the Kelley cut, we take & = bp — ax. 

The parameters are initialized as follows: mf = 0.5, mj = 0.2, co = 1.2 and 
DP, =1.5, m9 =1.5 and I, = 1.2, po = 0.1, Py = 1.5, pmin = 10~, tol, = 107%, 
tolg = 10°, tol = 10~®, tolz = 10~8, & = 10°, ap = —10° (when po # 0, ap = 
Vp, the value of v after the first iteration), b) = —40, ¢. = 107° and & = 1074, 
the initial point z° = [—100,..., —100]. 


3.1 Test Problems 


The problems used for these tests are instances of channel allocation and user 
association problem in wireless cellular networks which is described in [11]. The 
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quadratic reformulation of this problem, denoted by (sinr), is expressed below in 
the following page; where, with n,m, k,q € N. We have: 


The nonconvex quadratic constraints are: 


N ={1,...,n},M = {1,..., 


m+1},K ={,..., 


«} and Q = {1,...,q}. 


m 
YigrNo + > PypyigkGijrk = PjrGin, tEN, GE M,kEK. 


J =0,5 AI 


min s Ss Bij 


s.t. 


i=1 j=l 
m 

> ay = 1, 
j=1 

Wijk S Paik; 


» Vsijkhs — Pq(1 — yjr) < Wijk, 


Wijk S s VsigkPs, 
s=1 
bigk S Kbg tj, 
n 
So bist S Kb grey, 
l=1 
Big a Kq(1 


Yt - Kbq( (1 = Liz) \< Sk 
l=1 

we < uae 

l=1 k=1 


— ti) < &igt < Biz, 


VigrNo + 2 Py pVigrGijrh = PirGijr, 


j'=0,5' AG 
:S VsigkVs S Vigk S > VsigkYs+1) 
sEQ seEQ 
DS Vsijk = 1, 


sEQ 


K 

5 Pir = Pu;, 
k=1 

K n 

y Yik SK y Lig, 
k=1 i=1 


tEN 


i€N,GEM,KEK 


tENGECM,RECK 


i€N,GEM,KEK 


ile N,jEM 


iE NEM 


ile N,jEM 


iE NjIEM 


ie NjeM 


tENGEM,RECK 


ie MjeMkek 


i€N,GEM,KEK 


jEM 


jEM 
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nig > te jEM 
keK 
Stn < nay, jEeM 
keK 
Pin < Pryjr, jJEM,kek 
P; — P}(1 — yjr) < Pin < Pi, JEM,kek 
Lij; Yjk; Vsijk, Uz € {0,1}, sEQieNjeMkek 


O< wijk, ijt, Pir, Biz, Pj, Vases 


41EN, GEMKEK 


Allintegrality constraints 0-1, z € {0,1}, are reformulated as z?—z = 0. Instances 
for (sinr) are defined according to the values of n, m, k, g. We denote by: Fo: the 
largest value of constraints at the starting point x°, Fy: the largest value at the last 
obtained point, v*: the optimal value provided by Couenne, v: the optimal value 
provided by our algorithm, vo: the value of the objective function at the starting 
point «°, #2": the number of serious steps, % gab: for a variable var indexed by 
I(var), at optimum, var,o, denotes its value for Couenne and varpyng its value 
for the version bund of our algorithm (Table 3). 


Then let: 


- A(y) = { min{varpuna(s), 1 — varpuna(s)} <7, 8 € I(var)} if var is a 0-1 


vector 


~ A(y) = {|varcou(s) — varouna(s)| <7, 8 € I(var)} otherwise 


and 


#A(q) 


Table 1. Description of instances 


and gab,,ax = max A(7). 


Instance (name) 


Variables (var.) Constraintes (cntr.) | Fo v 

Nb. var. | Nb. var. 0-1 | Nb. | Nb. quad. 
2-2-2-2 103 39 247, 24 10100.0 | 9.7032E-17 
3-2-2-2 154 54 386 36 10100.0 | 1.4612E-16 
4-2-2-2 211 69 549 48 10100.0 | 2.0597E-16 
5-2-3-2 355 117 913 90 10100.0 | 3.1135E-16 
6-2-5-5 901 486 1577 | 180 10100.0 | 3.1378E-17 


3.2 Comments 


As shown in Table 2 (all-without vs all-classical and without-classical) and as 
expected, the update rule can substantially reduce the number of null step. 
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Table 2. Iterations (1000) from Algorithm 1 (all = na, cl, sp, co) and level = wt, sl. 


Instance (name) | all-without all-classical without-classical 
(all 4 without i.e. (all:puz, A 0) 
Le # 0) (all 4 without i.e. 
Hk # 0) 
cl-wt | co-wt | na-wt | sp-wt | cl-cl| co-cl) na-cl| sp-cl| wt-cl 
2-2-2-2 110 |110 {110 {110 (115 }115 {111 | 111 2 
3-2-2-2 110 |110 {110 |110 (115 | 115 | 111 (111 | 111 
4-2-2-2 110 |110 {110 |110 (115 | 115 | 111 (111 | 111 
5-2-3-2 110 |110 {110 {110 (115 | 115 | 111 (111 | 111 
6-2-5-5 110 |110 {110 |110 114/114 | 111 | 108 | 111 


Bundle methods provide a better exploration of the feasible set of the auxil- 
iary problem since they allow to generate cuts from interior points, unlike cutting 
plane methods which generate these cuts only from extreme points. However, in 
many cases, as shown by the results in tables 3, without-classical is better than 
the versions all-level (all 4 without) regarding the number of iterations and the 
quality of the solutions obtained. The cuts made in without-classical are deeper 
than those made with all-level (all 4 without). 


Table 3. Results obtained with some instances (all = natural, classical, special, 


and level = without, classical) 


constant 


Instance name | all-level (all: px A 0) without-classical 

vo — v* |v —v* Fy, ta* | vg — v* |v — v* Fy, tar® 
2-2-2-2 600.0 | —9.7032E-17 | 0.2222 | 2 600.0 | —9.7032E-17 | 0.0 2 
3-2-2-2 900.0 | —1.4612E-16 | 0.2222 | 2 900.0 | —1.4612E-16 | 0.222 | 2 
4-2-2-2 12000.0 | —2.0597E-16 | 0.2222 | 2 12000.0 | —2.0597E-16 | 0.25 | 2 
5-2-3-2 15000.0 | —3.1135E-16 | 0.2222 | 2 15000.0 | —3.1135E-16 | 0.239 | 2 
6-2-5-5 18000.0 | —3.1378E-17 | 0.2222 | 2 18000.0 | —3.1378E-17 | 0.222 | 2 


4 Conclusion and Extensions 


We have proposed a convexification scheme that is used in the proximal bundle 
method. A Kelley cutting plane version of this algorithm that includes a level 
update rule appears efficient to solve the considered instances. Several work mod- 
els can be designed from the proposed convexification scheme. The polyhedral 
model is built from the penalty function 


Wen sng vk (x) = Pe ue 


2 


|x 


g* ||? 4 ChE R(x)* 


Va € R” 
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with F,, 9«(x)* = max {0, f1(x)+ ®||z—2*|/?, ..., f(x) + %||2—x*||?}, cy > 0 
(penalty parameter), 7, > 0 (convexification parameter). When the next point 
x**1 is not feasible, then we have 


Wen me vk Qi) = re) rT Ck meant fe"), estas fae 


rast . 
+ Jatt a ||? 
b 


The convexification term (b) proceeds as a proximity control term. Since c, tends 
to become very large, descent step size from x* is more and more penalized. We 
can also consider the case for which we would look for the proximal point of f° at 
x* in aregion “centered” at at}. 

The presence of a penalty term that must take a very large value leads to very 
small descent steps and numerical difficulties. To correct this, combined with our 
new reformulation, the following well-known function can be used 


he(y) = max { f°(y) — f(a), F(y)} 


yER” 
where F(y) = max{0, f!(y),..., f™(y)}.- 
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Abstract. Numerous optimal design applications are black-box mixed 
integer nonlinear optimization problems: with objective function and 
constraints that are outputs of a black-box simulator involving mixed 
continuous and integer (discrete) variables. In this paper, we address an 
optimal design application for bladed disks of turbo-machines in aircraft. 
We discuss the formulation of an appropriate distance with respect to 
discrete variables which can deal with the cyclic symmetry property of 
the system under study. The necklace concept is introduced to charac- 
terize similar blade configurations and an adapted distance is proposed 
for discrete space exploration of a derivative-free optimization method. 
The results obtained with this method on a simplified industrial applica- 
tion are compared with results of state-of-the-art black-box optimization 
methods. 


Keywords: Mixed integer non-linear programming - Black-box 
simulation - Derivative free trust region method - Necklace distance 


1 Motivation 


Air traffic is one of the most important means of transportation, especially in 
Europe. Besides, it is connected with very high costs of fuel [1,2], and also 
with high costs of maintenance and of manufacturing. Thus, reducing fuel con- 
sumption by increasing engine efficiency and maintenance savings by decreasing 


vibrations are two major concerns of the aviation industry. 


There are several ways to optimize costs in aircraft: optimal trajectories, 
optimal seat arrangement designs, optimal cargo arrangements ... In our case, 
we want to optimize the design of turbo-machines, precisely, by maximizing the 


efficiency of compressor and minimizing the vibrations. 


© Springer Nature Switzerland AG 2020 
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In the concrete application proposed by SAFRAN, the optimization variables 
are of 2 types: continuous shape parameters, x, e.g. thickness, length of blades 
and binary variables y associated to each blade, with the value 0 for a reference 
shape and 1 for the other predefined shape (mistuning shape). Binary variables 
are used to locate these reference blade shapes on the disk. This parameterization 
provides the distribution of the two shapes around the turbine disk. 

There is a strong symmetry in this problem that should be taken into account. 
Two bladed disks that differ only by a rotation of the blade pattern around the 
disk will lead to the same simulation outputs, such arrangements are consid- 
ered as the same solution or called redundant solution: e.g. 001101001101 and 
010011010011. Note that the number of identical solutions increases rapidly with 
the number of blades (see Table 1). 

In practice, engineers try to overcome this difficulty in practical optimiza- 
tion by using Reduce Order Modeling Technique (ROMT), detailed in [6—8, 15]. 
Briefly, grouping blades by sectors of 2 patterns - 00 or 11 and 10 or 01 - tends 
to reduce the number of discrete variables from n to n/2. 

In general these mixed problems are NP-hard and difficult to solve. Espe- 
cially, for real applications, the difficulty lies in evaluating computationally 
expensive cost function: it can take several hours or even days to compute the 
functions to be optimized. Moreover, the derivatives are often not available. Most 
of algorithms for solving black-box Mixed Integer Nonlinear Problem are based 
on genetic algorithms which require a lot of function evaluations which are par- 
ticularly costly in our application. In the next section, we motivate the use of 
derivative-free trust-region method. 


Table 1. Number of distinct and total arrangements for a given number of blades 


Total number of | Number of distinct Total number of 
blades on the arrangements arrangements 
disk (N) (4 2% /N) (2%) 

2 3 

3 4 8 

5 8 32 
10 108 1024 
12 352 4096 
20 52488 1048576 


2 Derivative Free Trust-Region Method 


Among Derivative-free optimization (DFO) methods, one distinguishes direct 
search methods (e.g., directional or Nelder Mead simplex) and trust region meth- 
ods based on simple interpolation or regression models (linear or quadratic). The 
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direct search methods require a large number of simulations, whereas the second 
type of methods is generally more efficient to converge to a local solution. Con- 
vergence results to local minima are proved for the later methods but it requires 
adaptations to hope to converge to a global optimal solution (e.g., a multi-start 
approach with several initial points). 

We focus on surrogate DFO methods [4,10]. These methods aim to explore 
the optimization variable space by replacing the costly-to-evaluate functions by 
response surfaces, with a choice of the points to simulate based on a compromise 
between exploration (points far from points simulated at previous iterations) 
and exploitation of information captured by the response surface (points around 
potential optima). Most popular response surfaces for surrogate optimization are 
Radial Basis Functions (RBF) and Gaussian processes [11,14,17]. 

Besides, derivative-free trust-region algorithm is based on local quadratic 
models defined inside a trust region 


1 
m(z)=atg?z+ 52 Hz, 


whose coefficients are determined as the solution of minimal Frobenius norm 
problem 


1 
min >||H2 
a,g,H=H 2 
m(zi) = fi,t =0,...,p. 
where Z = (2,..-,2p) is the interpolation set for both continuous and binary 


variables z = (x,y), f; the associated objective function evaluations and |].|| r 
the Frobenius norm. The interpolation set is assumed to be poised, see details 
in [4,10]. 

The brief idea of the algorithm is to replace the initial problem, which involves 
the expensive simulations with quadratic optimization problem, simple to opti- 
mize. At the first step, we fix the binary variables to the best current solution 
and solve the quadratic sub-problem within the trust region with respect to 
continuous variables only. When a better point is obtained in this first step (a 
smaller objective function), the second step consists in minimizing the quadratic 
model with respect to both continuous and binary variables, thus solving a mixed 
binary continuous quadratic problem. 

The model is built in order to efficiently approximate the function and fulfills 
the fully-linear or fully-quadratic model properties to ensure the local conver- 
gence of the algorithm. Thus, model improvement steps are performed in order 
to optimize the geometry of the interpolation set, see details in [4,9, 10]. 

The introduction of binary (or integer) variables requires an adapted trust 
region definition. In [9], the authors introduce a /;—norm trust region for con- 
tinuous variables and the Hamming distance trust region for binary variables 


defined as 
So ue Saw ey, (1) 
JYe; =0 Fe; =1 


with A,, the size of the trust region for binary variables. 
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When the algorithm reaches a “local solution” (no further improvement), 
an exploration phase is necessary in binary variable domain. It is performed by 
adding a “no-good-cut” constraint to escape from this local solution y*, defined 


as 
Yo wt So (l-y) > 45, (2) 


Jy; =0 JyZ=1 


where (y*, Ay) are respectively the local solution and the current radius of the 
“sufficiently explored” area around this solution. Note that this approach leads 
to a bunch of constraints, one for each explored local solution. 

In the next section, we propose an adapted distance that will be used for the 
trust region associated with binary variables for our application. 


3 Adapted Distance for Blade Design Application 


To avoid the solution redundancy (illustrated in Table 1), engineers use ROMT 
which reduces the optimization problem size but has the disadvantage of remov- 
ing a large number of feasible solutions. Safran’s application with 12 blades has 
352 distinct arrangements but only 28 distinct arrangements for the two sub- 
problems of ROMT, limiting a lot the explored configuration set with a high 
probability to remove “good” candidates. 

Therefore, we attempt to define a new distance which can avoid the redun- 
dant solutions without arbitrary removal of configurations. The new distance 
should lead to simple constraints as the Hamming distance which leads to linear 
constraints (1). 

In order to detect similar blade arrangements, we introduce the concept of 
“necklace” [12,13]. In combinatorics, a k-ary necklace of length n is an equiv- 
alence class of n-character strings over an alphabet sau = {a1,...,a,} of size 
k, taking all rotations as equivalent. It represents a structure with n circularly 
connected beads which have k available colors. Our blade design application can 
be seen as a 2-color necklace optimization with a fixed number of beads. 

Using the concept of necklace gives an exact formula of the number of 
distinct arrangements which is the number of necklace for given n beads: 
1/n pae o(d)2"/¢, where ¢ is Euler’s totient function, the summation is taken 
over all divisors d of n. 

There are numerous applications based on “necklace” concept and various 
distances, e.g. geometry distance for music, swap distance [22-24], Hamming 
distance with shift [16], and necklace alignment distance (NAD) [5]. 

Following the idea of NAD, we propose a new distance, that we call in the 
following the necklace distance, 


dneck(Y, y') = min dy(y, Rot" (y')), (3) 


where dy denotes the Hamming distance, Rot’(y) is the rotation of r positions 
from y. It is clear that this distance satisfies 
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— the non-negativity property: dneck(y, y’) > 0, 

— the reflexivity property: dneck(y, y) = 0, 

— the commutativity property: dnecn(y,y’) = dneck(y’, y), 

— the triangle inequality property: dneck(y,y”) < dneck(y, y’) + dneck(y’, y”)- 


Besides these metric properties, dneck has one important property 
dneck(y, y') = 0 => y € Rot(y’) (4) 


which ensures to detect all the necklaces in the set. The definition of the necklace 
distance is not linear because of “min” operator. We propose in the next section 
an adaptation of this distance for the trust region and no-good-cut constraints 
used in the trust region algorithm. 


Reformulating the Distance for Trust Region DFO Algorithm 
The mixed binary continuous sub-problem associated with the minimization of 
the quadratic model is written as 


min m(a, y) 
yy 


min(gi(y, Ye)s eae In(Y, Yc)) < A, (5) 
lz — tell. S Ae, 


xe R",y € {0,1}", 


with x, and y, the current centers of the two trust regions in continuous and 
binary variable space with A, and Ay, the size of the trust region for respectively 
the continuous and binary variables and g;(y, y-) = dz(y, Rot'(y.)),t = 1,...,n. 
To deal with “min” operator in the constraints, we add slack variables t, 
integer, and Yn+41, Yn+2;---;Yntn, binaries, and propose an exact reformulation, 
minm,(2,y,t) = min m(a, y) + pt 
x,y,t x,y,t 
b> (Ys Yc) — Myn+1 


> 


i=l Yn, = — 1 
ta A, 


I|]z — xellt, < Ac, 
xz €R",y € {0,1}?",t€ Zy, 


with a real parameter > 0 and an integer parameter M > n. 

As explained before, the exploration phase uses “no-good-cut” constraints 
(2) to enforce to explore new values for binary variables. The maximum number 
of constraints is 2" — 1. If we apply the same trick as in (6) for exploration 
phase, we highly increase the dimension of the problem due to the additional 
slack variables, n + 1 for each “no-good-cut” constraint. We use instead the 
equivalence 


min(91,92)-.+;9n) = Ay => 9 > Alji=1,... 
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and thus replace no-good-cut constraints by at most n linear constraints 
dy(y, Rot(y*)) > Aj,...,du(y, Rot(y*)) = Aj. 


Note that, in practice, we use A? = 1. 


4 ‘Toy Problem 


The problem of determining the best-case (or worst-case) mistuning pattern 
and maximizing the forced response vibration amplification by the addition of 
a small mistuning to a perfectly cyclical bladed disk is mentioned in literature 
(6-8, 15, 19-21]. We use a single degree-of-freedom (DOF) per blade disk model 
(see Fig. 1). 


Fig. 1. Single DOF per blade disk model [7] 


The DOF problem is formulated as 


minimize ||Al|o 
wy 
subject to w € [Wmin, Wmaz]; (7) 
Yi€ {0, 1}, 
To —ke O ... O —ke 


— ; —k. Ti —ke.... 0 0 
where A=T1F,F,=Fo’*,T=| 2. . 2, 
—k. 0 0 ... —ke Tn—-1 
T; = —mjw? + jwc + 2k. + (1 + 6;)ky. The nomenclature is detailed in [7]. 


5 Preliminary Numerical Results 


We present some preliminary results of our trust-region method, called DFOb in 
the following, applied to the toy problem and to a simplified application provided 
by SAFRAN. 
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Results for the Toy Problem 

In general, optimization researchers ideally seek for an algorithm which provides 
the smallest objective function value comparing to alternative algorithms. But, 
for derivative free optimization methods dealing with expensive-to-evaluate sim- 
ulations, the efficiency of an optimization algorithm is generally measured by 
comparing the objective function value for a given budget of simulations (fixed 
number of simulations). 

On this toy example, we run 10 times our method DFOb with different 
initial random points with two different distances for binary variables: Hamming 
distance and the necklace distance proposed in previous section (Eq.3). The 
new implementation allows to reach a better point than the Hamming distance 
implementation as shown in Fig. 2. The number of simulations to reach the same 
objective value, ‘9.68272¢e—04, is larger with Hamming distance (90 simulations) 
than with the necklace distance (60 simulations). 


~ 104 Mean Best Function Value 


—*— DFOb-d__, 
—s— DFO-d,, 


9.68275 


OF value 


9.6827 


9.68265 


9.6826 ; : 
20 40 60 80 100 120 


Number of simulations 


Fig. 2. Mean best average objective function obtained with 10 random initial datasets 
with Hamming distance and with the proposed distance. 


Results on SAFRAN Application 

We apply our method DFOb on the two sub-problems of ROMT and compare 
the results with NOMAD, [3,18] and RBFopt [11] optimization methods with a 
fixed budget of 100 simulations: 


1. ROMT sub-problem 1 with patterns 00 — 11, 
2. ROMT sub-problem 2 with patterns 10 — 01. 


DFOb and RBFOpt share the same initial points (random choice from RBFOpt 
method). NOMAD’s initial point is chosen as the best point of the initial set 
with regard to objective function value. Some constraints are added in order to 
avoid trivial solutions with only one type of patterns: only zero or one values 
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Fig. 3. Comparing results obtained with Hamming and necklace distances for the two 
ROMT sub-problems with patterns 00 — 11 (left) and 10 — 01 (right) of Safran appli- 
cation 


in vector of binaries y. Figure 3 illustrates the efficiency of DFOb coupled with 
the necklace distance compared to Hamming distance for binary variables for 
the two sub-problesm of ROMT. For ROMT sub-problem with 00-11 patterns, 
DFOb with the necklace distance provides an objective function of 1.033509 
after 57 simulations while the same algorithm with the Hamming distance does 
not reach this value after more than 100 simulations. For the first ROMT sub- 
problem (see Fig.4), DFOb method coupled with necklace distance has very 
good performance compared to RBFopt and NOMAD. NOMAD and RBFopt 
provide infeasible points during the exploration phasis (with small objective 
function values), the constraints being handled as soft constraints. In DFOb, 
the constraints are taken into account explicitely. 

For the second ROMT sub-problem (see Fig. 5), DFOb finds good points 
rapidly after less than 10 simulations, nevertheless, it does not reach the best 
objective function value obtained by NOMAD and RBFopt within the fixed 
budget of simulations (100). With a different initial set, DFOb is able to find 
the global solution within 100 simulations. A future study will focus on the 
sensitivity of the results of our method to the initial set and of its size. 
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Fig. 4. Compare DFOb, Nomad and RBFOpt results obtained for sub-problem with 
patterns 00 — 11 of Safran application. Square black symbols indicate infeasible points. 


= s— DFOb-d_ 
—*— RBFopt 
—*— NOMAD 

O _Infeasible points 


Fig. 5. Compare DFOb, Nomad and RBFOpt results obtained for sub-problem with 
patterns 10 — 01 of Safran application. 


6 Conclusions 


In this study we address black-box optimization problems with costly-to-evaluate 
objective functions. A trust region derivative free optimization method adapted 
to mixed binary and continuous variables is presented. In order to improve the 
exploration phase of this algorithm for a blade design application, we introduce 
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a distance in binary variable domain, that takes into account the symmetry of 
the problem. Preliminary results illustrate the performances of this new dis- 
tance compared to classical Hamming distance. The comparison of the proposed 
method with two state-of-the-art methods NOMAD and RBFopt shows some 
encouraging results. 
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Abstract. The paper considers numerical approaches for solving opti- 
mal control problems with free trajectories at the end of the time interval. 
A modification of the algorithm of the conjugate gradient for studying 
the controlled dynamic problem is presented. The proposed technique 
has been tested by using the test optimal control problems. We describe 
the results of solving an applied problem of nanophysics. It is considered 
two cells of a quantum computer which are based on four tunnel-coupled 
semiconductor quantum dots. The multistage series of computations for 
investigation of system dependence from changes of the model parame- 
ters values are carried out and allowed to demonstrate the effectiveness 
of the proposed approach. 


Keywords: Numerical algorithm - Optimal control - Applied problem 


1 Introduction 


Optimization problems, both finite-dimensional and infinite-dimensional, quite 
often meet in various scientific and technical fields. The first extremal problem 
appeared in the framework of the calculus of variations. The main attention was 
paid to the analysis of smooth functionals that were defined in the whole space 
or limited by a smooth set. The development of computing technologies has led 
to the emergence in practice of new problems where control have changed in 
some closed set. A wide class of such problems was investigated in the works of 
L.S. Pontryagin, et al. [1], who received the necessary condition for an extremum 
(Pontryagin’s maximum principle). 

The first optimal control problems are considered to be the problems in 
which the search for the control law that ensures the minimum transition time 
is carried out. The problem of time-optimal control (performance problem) was 
developed by R. Bellman, N.N. Krasovsky, R.V. Gamkrelidze, M. Atans, P. Falb, 
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which presented the results on the existence, uniqueness, and basic properties of 
a speed-optimal control (see, for example, [1-3]). 

In the course of time, the new problems with functionals of a more complex 
structure and nonlinear systems described the dynamical processes under study 
appeared. 

Today, the optimal control problems (OCP) with free trajectories at the end 
of the time interval for a differential system with an objective functional is a 
relevant topic in control theory. The optimal control problems can be considered 
as an auxiliary problem for the study of more complex problems, for example, 
with restrictions of the terminal and phase type. The solution of these problems 
is required to be performed repeatedly on the numerical methods iterations. In 
our opinion, this problem is a start stage for solving the actual problems of 
various classes. 

Many theoretical studies focus only on a local search for an extremum in 
the optimal control problem. Nevertheless, it is necessary to study the global 
optimization problems for dynamic systems within a framework of the present 
control theory and applications. The investigation of the non-convex extremal 
OCP is an important to be a important scientific problem. 


2 The Optimal Control Problems with Free Trajectories 
at the End of the Time Interval 


The following problem statement is investigated: the controlled dynamical pro- 
cess is described by the differential equations system 


z(t) = f(x,u,t), 2(to) = 2°, t € T = [to, ti). (1) 
The initial conditions for the trajectories and control conditions are given 

u€U={ue BE: ul; <u< ugi, i=Ir.} (2) 
We need to minimize the functional of a terminal type 


Io(u) = po(a(tr)). (3) 


It is necessary to find the optimal functional value (3) obtained using the admissi- 
ble control from the convex set U. f(a, u,t) and yo are continuously differentiable 
in all arguments. 

We can solve the problems of another class, reducing to the considerable 
statement (1)—-(2) with the functional (3): in addition with the direct control 
constraints, also with the terminal and phase constraints, optimization the inte- 
gral functional, optimization of the control-constants, as well as solving the time 
minimization problems and others. 
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3 Numerical Technologies for Studying OCP 


The paper considers the numerical technique for investigating OCP with the free 
trajectories in is the end of the time interval. An idea of a multiple search for a 
minimum functional is one of the most popular and informative approach. The 
multi-start technique is based on multiple extreme search with various abmissible 
controls. This method allows us to find the global extremum, construct the 
reachable set for the controlled system of the differential equations, and, alsow, 
evaluate the region of attraction for different extremal points [4]. 

Another effectiveness numerical methods for search the extremal value of the 
objective functional are Pontryagin’s method [1] mofidication, based on the non- 
local maximum principle; method for convexifying of the reachable set, based 
on the solving the extended OCP [5,6], and methods oriented to the sequential 
search of local extremes in order to find the best solution (for example, technique 
of the curvilinear search [8] and algorithm of the tunnel type [9]). 

Among the other approaches, it is necessary to mention the enumeration 
method of local extrema (for instance, heavy ball method, Branin’s method), 
methods of integral representations (Chichinadze’s method), and evolutionary 
algorithms [12-15]. Among the approaches developed in the optimal control 
theory, the methods of solving Bellman equation (Krotov-Bellman, Hamilton- 
Jacobi-Bellman methods) take a special place [16,17]. They are, in particular, 
characteristics method and semi-discretization method. These methods either 
impose strict restrictions on the structure of the problem or are focused on 
searching only for the local extremum of the target functional. 

For nonlocal nonlinear OCP with controls of the relay type, we construct 
stochastic algorithms based on finite-dimensional optimization methods: the 
genetic algorithms of the extremal search, the random coverings algorithms, 
Shepard’s algorithms and others. We realized the techniques of the postopti- 
mization analysis and used specialized visualizations for evaluating the quality 
of the obtained results. 

As an example, we present one of the implemented algorithms of gradient 
type for investigating OCP with the free right-hand end of the trajectory. The 
conjugate gradient method is one of the most famous and well-studied methods. 
Nevertheless, this method is effective for considerable problems and continues to 
be one of the best in the class of methods that do not use quadratic memory: 

Step 1. Select u°(t), t € T. 

Step 2. Set the update frequency K. 

At the k-th iteration: 

Step 3. Calculate the conjugate coefficient 

k 2 
BY = Tercera if k A KK, 
=O, ik= Kk. 


Step 4. Calculate the descent direction g*(t) = —VIo(u*(t)) + B¥q*— l(t), 
te T. 
Step 5. Find u**1(t) = arg min {Jp(u*(t) + ag*(t),0 < a < +00}. 
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The iteration is complete. 


To investigate the presented technique, it is developed the collection of opti- 
mal control problems [7], including up to now more than 180 model examples. 


4 The Optimal Control Problem in the System of Four 
Semiconductor Quantum Dots 


Computing technologies developed by the authors were tested by using test 
OCPs with the free right-hand end. We present the results of solving an applied 
problem from the field of nanophysics using the proposed approaches. 

We consider two cells of a quantum computer which are based on four tunnel- 
coupled semiconductor quantum dots (QD) in the Ge-Si system. The size and the 
density of QD are determined to the existence of sufficient tunnel couple in the 
top layer for implementation of quantum logical operations. The tunnel couple in 
the lower layer does not exist. A voltage impulse applies to metal gates, which are 
above QD, lead to tunneling of electrons from lower to top dots. The exchange 
operation of information (SWAP) implements due to movement particles to a 
top layer. The problem consists of finding optimal form and length of the voltage 
impulse for the realization of SWAP operation. 

Nonstationary Schrodinger equation is employed for the description of con- 
trolled electrons movement in the system of vertically combined QD. The math- 
ematical model of the analyzed subject is formulated as a system of controlled 
differential equations: 

c= Ao(u)x ’ (4) 


where the matrix dimension Ag(u) is 32 x 32 [10]. The structure of this matrix 
is presented in the first initial statement [11]. It is necessary to transfer the 
system from a(to) = x° to 2(t,) = z' with minimum of the expended energy. 
The energy required for the movement of the electrons between the different 
QD is the control. It is formulated the objective functional which contains two 
parts: the control variation in all time interval and the derivation of the system 
trajectory from the terminal conditions 


32 


T 
I(u) = a | u'dt + 89 S > (a; (#1) — x5)? — min. (5) 


j=1 


We deal with variants of this problem with the different values of the coefficients 
$1, 82 and the functional of the response speed (I(u) = t; — min). 

Numerical solutions to the problems of this series are implemented using the 
developed approach. 

The first trouble of numerical solving of this problem is a slow convergence 
of algorithms in all-time interval. We use well-known method of parameter con- 
tinuation for overcoming of this trouble, the system of differential equations is 
supplemented with a new parameter p: % = p, f(x, u,t). A numerical experiment 
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in a small time interval allows us to get a close approximation for carrying out 
the next computation. It is found the strategy of parameter continuation which 
permits to pass from solving simple problems to initial one with accounting 
results from previous stages. 

The most complicated and laborious problems turn out a response speed 
problem. Time estimation of response speed was given by experts and was equal 
to 1500 units. We can make better this estimation and find an optimal value 
proved to be 606 conventional units. It is found the optimal form of controlled 
voltage impulse and trajectories of the probability of electron’s staying in differ- 
ent QD by developed algorithms for solving OCP. It is detected that the existence 
of a tunnel couple between QD in vertical line leads to appearance additional 
errors for implementation of SWAP operation for electrons in the lower layer. 

To study the quantum logical operation SW AP we use the following func- 
tional [10] 


32 
I(u) = So aj(t) +4 — 2 D? + E? = min, (6) 


here D and EF are defined as D = 21 + 0.5((@19 + 211) + (X14 — @15) + (X18 
X19) + (#93 — £22)) + 28 and # = v5 +0.5((e11 X10) t (w14 t X15) t (18 X19) 
(x22 + %93)). The structure of the control is given 6B(t) = Acos(wt + y) +C, 
where A € [0,3] is the amplitude of the oscillations, w € [0,200] is the frequency, 
y € [0,27] is the initial phase, C € [—3,3] is the constant displacement. 

The series of the computational experiments for the quantum logical opera- 
tion VSW AP made it possible to evaluate the accuracy and reduce its execution 
time. The structure of the obtained optimal control is presented in Fig. 1 (on the 
left side). The analysis of the dependence of the error functional on the change 
in the constant bias value (parameter C) for the operation is performed. The 
control structure was complicated: 5B(t) = Acos(wt + y) - sin(m-t/0.1766) + C. 


sB(N) sB(N) 
0.6 0.2 
0.4 
0.1 
0.2 
0.0 0.0 
-0.2 
-0.1 
-0.4 
y 
-0.6 t 0.2 t 
0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 


Fig. 1. Optimal control for quantum logical operation “SW AP. 
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We selected other intervals for changing the controlled parameters A € [0,10], 
w € [0, 1000], » € [0, 27], C € [—3,3]. This made it possible to obtain the control 
of the different structure (Fig. 1, in the right side) and increase the accuracy in 
the functional (its value decreased from 4.77796 - 107% to 4.23987 - 10-3). The 
optimal parameter values for this control are as follows: A = 0.100, w = 887.689, 
yp = 4.985, C = —0.036. The harmonic dependence of used magnetic field 6B(t) 
provides the universal set of the logical operations necessary for the quantum 
computing. 


5 Conclusion 


The various control structures for the system of the quantum dots are inves- 
tigated, the obtained solutions are presented for different cases. The results of 
study of the electron states in the model with two layers of the QD show the 
possibility of using these states for quantum computing. 

Thus, the proposed numerical technology can be used to solve applied OCP 
from various scientific and technical fields. 
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Abstract. This paper investigates the overhaul maintenance scheduling 
problem in which the maintenance duration is uncertain at the time of 
planning. This problem involves specifying the dates of trains’ arrival at 
the maintenance centre while taking into consideration the due windows, 
the desired number of trains in service, and the capacity of the main- 
tenance centre. The cycle time of each type of trains is random with a 
known probability distribution. The objective is to minimise a weighted 
sum of two components: (i) the deviation of the assigned arrival dates 
from the due windows and (ii) the penalty for violating the resources’ 
constraints. A combined genetic algorithm with sample average approxi- 
mation solution approach is developed to solve this problem. The solution 
approach consists of a genetic algorithm for global search and an exact 
method to determine the arrival dates of train-sets when a sequence of 
train-sets is known. The results with data provided by one of the lead- 
ing Australian maintenance center show that the proposed method can 
produce good solution within acceptable computation time. 


Keywords: Genetic algorithm - Stochastic cycle time - Quadratic 
earliness/tardiness - Sample average approximation 


1 Introduction 


This paper deals with the scheduling of overhaul maintenance of trains, arising 
in the realm of passenger rail services. In the case of overhaul maintenance, the 
trains are withdrawn from service and are sent to a specialised maintenance 
centre where they will stay for at least one month for the entire maintenance 
process. A typical objective of this problem is to minimise the total cost of 
earliness and tardiness subject to the centre capacity. According to [8], this 
problem is known as the Overhaul Maintenance Scheduling Problem (OMSP). 
This paper addresses the stochastic version of OMSP, where the dwell time of 
the trains at the maintenance centre is uncertain at the time of planning. We 
refer to this problem as Stochastic Overhaul Maintenance Scheduling Problem 
(SOMSP). 


© Springer Nature Switzerland AG 2020 
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In SOMSP, we consider the perspective of the maintenance centre’s planning 
manager who must determine an arrival plan specifying the arrival dates for all 
trains for one year or for a longer period, where each train has a distinct due 
window and a random dwell time! that follows a known probability distribution 
associated with the type of train. During the planning phase, one must take into 
consideration the distinct due windows which are the desired arrival windows of 
the trains. It is noted that the definition of due window in this paper is different 
from those in most scheduling literature whereby it is referred to as the desired 
completion window of a job. 

The train arrives at the maintenance centre in groups. Each group is com- 
prised of several cars coupled together and is referred to as a set or a train-set 
(see for example [4]). All cars in a train-set undergo the maintenance in the 
maintenance centre simultaneously. A feasible arrival plan is one that satisfies a 
number of constraints. Firstly, a team of technicians and engineers with a broad 
range of skills is needed to perform the maintenance tasks. To complete the 
maintenance operations, the team must use various equipment and materials. 
These renewable and nonrenewable resources can be quantified as centre capac- 
ity. The centre capacity imposes a restriction on the number of train-sets which 
can undergo maintenance simultaneously. 

Secondly, on the arrival date, the train-set is completely withdrawn from 
service and must stay at the maintenance centre for at least one month. This 
long cycle time directly impacts the number of train-sets available in active 
service. Therefore, a permissible number of train-sets that can be taken out of 
service simultaneously are specified for each type of train-sets. 

Solving the SOMSP is challenging for various reasons: (i) the single machine 
scheduling problem with earliness and tardiness objectives, which is similar to 
the OMSP, is known to be NP-hard [11]. Hence, the considered SOMSP with 
stochastic dwell time is even harder to solve; and (ii) due to the non-uniform dis- 
tribution of the desired arrival window, there is a trade-off between respecting the 
time window and satisfying the centre capacity and operational requirement [6]. 

Due to the complexity of SOMSP, a Genetic Algorithm (GA) is proposed 
to solve the considered problem. Based on previous study in [2], it is noted 
that the decoding procedure, where a chromosome is transformed into a feasible 
arrival plan, is a crucial step of GA. The decoding procedure used in [2] is a 
simple greedy heuristic which does not generate good solution. To improve the 
performance, we develop a new decoding procedure based on Sample Average 
Approximation (SAA) method in this paper. 

The remainder of this paper is organised as follows. Section 2 presents the rel- 
evant literature. Section 3 gives the mathematical formulation of the considered 
problem. Section 4 shows the model formulation using SAA approach. Section 5 
describes the proposed genetic algorithm in detail. Section 6 reports the results 
of the computational experiments. Finally, Sect. 7 concludes the study and gives 
directions for future research. 


' In the context of this paper, dwell time and cycle time have the same meaning. The 
two terms are used interchangeably. 
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2 Literature Review 


Papers that consider the scheduling and planning of rolling stock’s maintenance 
within the railway industry can be categorised into two groups. The first focuses 
on the running maintenance which is performed during the connection between 
two consecutive paths or overnight at the rolling stock depot. These studies 
generally consider the perspective of the railway provider who must design a 
rolling stock rostering plan while taking into consideration various maintenance 
requirements. The design objectives often include minimising a weighted sum of 
maintenance cost, deadhead cost due to unexpected failure, and substitution cost 
due to the use of undesired train-sets [4]; and minimising the sum of deadhead 
costs and number of train units [9]. 

The second focuses on the overhaul maintenance which is performed at a 
specialised workshop. These studies generally consider the perspective of the 
workshop’s planning manager who must determine the dates on which the trains 
must be sent to the workshop for maintenance. As an early work, [8] proposes 
a genetic algorithm for solving the OMSP. Doganay and Bohlin [1] formulates 
the OMSP as a mixed integer linear programming model and solves it by exact 
method. Lin et al. [6] formulates the high-level maintenance scheduling for high 
speed trains as a mixed integer linear programming model and proposes a sim- 
ulated annealing algorithm for solving large-scale instances. From these studies, 
we can observe that the objective of minimising the total cost of earliness and 
tardiness is a common design objective of OMSP. 

The proposed SOMSP is similar to a Stochastic Multiple Resource Con- 
strained Scheduling Problem (SMRCSP), in which one need to decide the start 
times of jobs requiring different types of resources under uncertain processing 
time [3]. This kind of problem has application in appointment sequencing and 
scheduling where jobs correspond to operation appointments and resources cor- 
respond to doctors, nurses and operating rooms [7]. However, the key differences 
between the proposed SOMSP and the aforementioned problems involve: (i) each 
type of train-sets has a known processing time on the first operation line where 
preemptions are not allowed. That is, the scheduling on the first operation line 
can be treated as a single machine scheduling problem; and (ii) for train-sets of 
the same type, there exists a noticeable sequence among them. That is, train-set 
with earlier due window should always arrive for maintenance earlier than the 
others of the same type in order to minimise the objective function value. 


3 Mathematical Formulation 


Consider a set N := {1,--- ,n} of train-sets and a planning period of T days. 
The planning horizon is discretised into calendar days which are indexed from 0 
to T—1. 

Each train-set 7 € N has a due window [e,,1;], where e; is the earliest desired 
arrival date and 1; is the latest desired arrival date. The earliness and tardiness 
of train-set 7 if it arrives on day t are defined as Ej, = max{0,e,; — t}, and 
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Ty = max{0,t—1,}, respectively. Let cj; be the cost for assigning train-set 7 to 
arrive on day t, as indicated in (1). 

Cit = ME}, + dT, (1) 
where A, and Az are the earliness and tardiness cost factors, respectively. 

There are m types of train-sets and the set of all train-sets is partitioned into 
m families. Each train-set belongs to a train family F*,k © K = {1,--- ,m}. 
Each type of train-sets requires a minimum duration on the first operation line, 
during which the maintenance operations must be performed without interrup- 
tion. For each train family F*, let py be the minimum duration on the first 
operation line. The minimum duration is assumed to be deterministic. 

For each day t, let C; be the centre capacity (i.e. the number of train-sets 
which can undergo maintenance simultaneously). Violation of this restriction is 
permitted for a penalty cost. Let 6 be the unit penalty for violating the centre 
capacity. In practice, the penalty is associated with each additional train-set to 
account for overtime, outsourcing, and hiring contractors. 

For each train family F* and each day t, let Cy, be the permissible number 
of out-of-service train-sets. Violation of this restriction is permitted since the 
shortage of some types of train-sets can be substituted by a different train type. 
Let dx¢ be the unit penalty for violating the limit C;;. In practice, a penalty is 
associated with each substitution to account for the passengers’ dissatisfaction 
due to the differences in their configurations. 

The cycle time of each train-set j is a random variable D,;. Train-sets in a 
train family follows the same probability distribution. It is assumed that the 
probability distribution for the cycle time of each type of train-sets is known. It 
is further assumed that all D; are independent. 

We introduce the time indexed binary variables xj, € {0,1} which is equal 
to 1 if train-set 7 € N arrives at the maintenance centre on day t, and is equal 
to 0 otherwise. Then, for each train-set 7 € N its arrival day is defined as 


cea 
s;= )_ try, VIE N (2) 
t=0 
Accordingly, we can define an arrival plan s = (s1,--- ,$,). For any arrival 


plan s and any integers 1 < k < m and 0 < t < T, the number of trains of 
family k that present at the centre on day t is 


Zit = S- B(s;,t), (3) 
jerk 


where 
_ jl ifs; <tands;+D;>t+1 
B(s;,t) = ie otherwise : 


Then, the total number of trains present at the centre on day t is 


Zi = S> Zit: (5) 


kek 
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If it is clear which arrival plan is considered, the superscript s can be dropped 
and the notation Z; (Zxz) can be used instead of Z? (Zj,). 


Formulation of SOMSP 


T-1 
Min a ) ) CitX jt 


jEN t=0 

T-1 

+ 68 > (' y [(Z - C;)T] + Pe Ont a [(Zie — cv) (6) 
t=0 kek 
subject to 

T-1 

Lit = 1, Vj EN (7) 
t=0 


S- S- tje<1, Vie [0,T-1] (8) 
keK jeF® s=max(t—pz+1,0) 
3), (4), (5) 
en € {0,1}, VWiEN, vee [0,T-1 (9) 


Where (a)t := max(a,0); E[.] denotes the expectation operator; a and (3 
are weights reflecting the relative importance of the two components of the 
objective function. The objective function (6) minimises the weighted sum of 
two components: the cost incurred if train-set j arrives for maintenance on day 
t (ie. the quadratic earliness and tardiness costs); and the expected penalties 
for violating the limits C; and Cy. Constraint (7) guarantees that each train-set 
is scheduled for maintenance on a particular day ¢ within the planning horizon. 
Constraint (8) depicts the restriction on the first operation line. If a train-set 
j occupies the first operation line in a given time interval, other train-sets are 
not allowed to arrive during this period. Constraint (9) states that the decision 
variables are binary. 


4 SAA Model 


Using the sample average approximation approach, the SOMSP formulation, 
presented in Sect. 3, can be rewritten as a mixed integer programming model. 

Consider a set 2 := {1,--- ,w} of scenarios. Each scenario comprises a vector 
of realisations of cycle time which are drawn independently from the distributions 
corresponding to each train family. Let €? be the cycle time of train-set j in 
scenario w. 
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Formulation of SAA Model 


(SAA) wes Gee i] s 


jEN t=0 weEQ 


(5 1 ea y sec (10) 


subject to 


(7), (8), (9) 
ys S- Lig <Cp+ 2, 


JEN s=max(t—€% +1,0) 


Wwe, vee[0,T-]| (11) 


t 
» a Lis S Cee + Ze: 
JEF*® s=max(t—€¥ +1,0) 
Ywe?O, Vkek, Vtel0,T-1] (12) 
2? > 0, 2, > 0, Wwe, VkeK, Wte[0,T—1) (13) 


The objective function (10) is the weighted sum of two components: the first 
component is the same as the first component of objective function (6), while the 
second component is the sample average of the penalties for the additional train- 
sets. We call the first component and second component of objective function 
(10), the earliness tardiness cost (ETC) and the resource violation cost (RVC), 
respectively. For every scenario w, constraints (11) and (12) describe the addi- 
tional train-sets, represented by the slack variables z;” (zj!,), based on the limits 
C;, and Cy. Constraint (13) is the non-negativity constraint. 


4.1 Property of SAA Model 


SAA has been extensively used in literature to solve stochastic optimisation 
problems (see for example, [7]). However, solving SAA model by itself can be 
computationally challenging as the number of scenarios increases. To demon- 
strate the complexity of SAA model, we perform a preliminary experiment with 
various number of scenarios. The result in term of computation time is reported 
in Table 1. In the same table, we also present the computation time required for 
solving SAA model if a sequence is given. Given a set of train-sets, a sequence is 
the order in which the train-set arrives at the maintenance centre. Let EF be a set 
of arcs representing the precedence relations obtained from the sequence. The 
SAA model with sequence (SA A-sequence) includes the addition of the following 


constraint: 
T=1 P4 
fees So tc, Vi, j)€E (14) 
t=0 t=0 


Constraint (14) enforces the precedence relations among train-sets according to 
the given sequence. 
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Table 1. Computation time (in seconds) required for solving SAA model with and 
without sequence 


Number of scenarios | SAA model} SAA-sequence model 
5 288.348 9.375 
10 = 21.146 
20 = 44.947 
30 = 62.048 
40 - 81.929 
50 - 129.034 
100 — 421.410 


From Table 1, we note that SAA model cannot be solved within the time limit 
of one hour when the number of scenarios is larger than or equal to 10 scenarios. 
On the other hand, solving SAA model with given sequence is relatively easier. 
Problem instance with 100 scenarios can be solved in less than 500 s. 


5 The Solution Method 


Motivated by the result in Sect. 4, a genetic search with SAA (GS-SAA) is pro- 
posed to solve the problem. The idea of GS-SAA is simple. It combines genetic 
algorithm for global search and solving SAA-sequence model for a solution. The 
genetic algorithm is inspired by [5] and has previously been studied in [2]. Detail 
of the proposed GS-SAA is given below. 


Algorithm 


1: Generate initial population 

2: Decode the chromosomes into solutions by solving SAA-sequence model 
3: Evaluate the solutions 

4: for the number of generations < I[Tmaz, and time < Taz do 

5D: Select parents 

6: Generate offspring (two-point crossover) 

t Diversify offspring (mutation) 

8 Decode offspring into solutions by solving SAA-sequence model 

9 Evaluate the solutions 
0: end for 
1: return the best solution 


The proposed GS-SAA starts with the generation of an initial population 
of size pw. Then, each chromosome in the initial population is decoded into a 
sequence. The sequence decoding procedure works as follow. An arrival plan, 
or a solution P is represented by a chromosome. Each gene in the chromosome 
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corresponds to a train-set and the gene value is generated randomly from the 
uniform distribution U(0,1). The gene values determine the priority of the train 
types. Respecting this priority list, the priority of train-sets (i.e. the sequence of 
train-sets) is obtained by sorting the train-sets in the same train family based 
on its earliest desired arrival date (see Fig. 1 for an example). 


Train-set Index 1 2 3 4 S 
Train type v K T T T 
Due window [0, 28] [19,47] [30,58] [34,62] [41,69] 
Chromosome 0.57 0.08 0.84 0.12 0.23 
Sorted gene values 0.08 0.12 0.23 0.57 0.84 
Priority of train types K T T Vv Te 
Decoded sequence 2 3 4 1 5 


Fig. 1. Example of sequence decoding by train types. 


Once the sequence is known, an arrival plan, or a solution P is generated by 
solving the SAA-sequence model. Next, the solution P is evaluated according to 
the objective function (6). The evaluation is the same as that given by [2]. 

Given the current generation, the next generation is produced through elite 
selection, crossover, and mutation. 

In the elite selection phase, some of the best individuals of the current gen- 
eration will continue to exist in the next generation. The number of surviving 
individuals is equal to Nerte. The elite selection strategy is motivated by the 
“Survival of the fittest” [10]. However, the drawback is that it can lead to pre- 
mature convergence (i.e. catch in a local optimum). 

In the crossover phase, a child is produced based on the two-point crossover 
as in [2]. The two parents are randomly selected from a pool of parents. This 
pool consists of the top Nerre best individuals of the current generation and 
the remaining candidates are created from the Roulette Wheel Selection. Each 
proportion of the wheel is given to an individual of the current generation. The 
size of the proportion is equivalent to the probability of selection, which is defined 
as 

Probability of selection = a 
sum of fitness 
where fitness is equal to the inverse of the value of the objective function (6), 
and sum of fitness is equal to the total fitness of all individuals. 

In the mutation phase, some gene values of the child chromosome will be 
replaced by a random number sampled from the uniform distribution U(0,1). 
The replacement is initiated when the random number is less than P,,,. Detail 
of the mutation process is discussed in [2]. 


6 Computational Results 


The GS-SAA method was coded in Python 2 on an Intel Core i5-6500 CPU 
@3.2 GHz with 8 GB of RAM. The SAA-sequence problem was solved using the 
Branch-and-Cut method in IBM ILOG CPLEX 12.6.1 on the same computer. 
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We adopt the test problems in [2] where there are 3 train families with 35 
train-sets in total. The planning horizon is set to one year. The parameters 
associated with the train families are shown in Table 2. Furthermore, we set C; 

5, 6g = 1, Ay = Ae = 1, a = 1, and G = 1000. For the setting of our genetic 
algorithm, we have uw = 20, Netite = 2, Pm = 0.05, [Tinaz = 40, and Tmax = 3600 
(s). The information on the probability distributions of the cycle time by train 
family is given in Table 3. 


Table 2. Parameters for the train families 


Train family ||F*|| px | Norm | Special 


Crt | Oxt | Cre | One 
1 25 |4 (3 1 j1 10 
2 5 5 2 1 {1 10 
3 5 |5 {1 1 j1 10 


Norm refers to normal day and Special refers 
to special days. 


Table 3. Probability distribution information for cycle time by train family. 


Train family | Minimum | Most likely | Maximum | Distribution 
1 20 25 40 beta-PERT 
2 27 30 46 beta-PERT 
3 29 30 52 beta-PERT 


In using SAA, one question to ask is how many scenarios are required in 
order to obtain solution with good quality within acceptable computation time. 
To answer this question, we extend the preliminary experiment in Sect. 4.1 for the 
SAA-sequence model such that both the solution quality and computation time 
are presented. The SAA-sequence model is used to solve the test problems with 
5, 10, 20, 30, 40, 50, and 100 scenarios. The scenarios are randomly generated. 
The average objective value and average computation time obtained after 5 runs 
are reported in Table 4. 

From Table 4, it can be seen that the computation time increases significantly 
as the number of scenarios increases. As a good trade-off between solution qual- 
ity and computation time, SAA-sequence with up to 10 scenarios is sufficient. 
Therefore, we set |2| = 1,5, 10. 

Table5 shows the performance of our GS-SAA method with 1, 5, and 10 
scenarios, in comparison with the method proposed in [2] (denoted as MIM) 
and solving SAA model by itself with 5 scenarios (denoted as SAA-5). For GS- 
SAA and MIM solution approaches, the best solution of the initial population is 
reported under the column titled “In.”, while the best solution of the final popu- 
lation is reported under the column titled “obj.”. For SAA-5 solution approach, 
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Table 4. Average results for 5 runs of SAA-sequence for different number of scenarios 


Number of scenarios | Computation time (s) | Objective value ETC RVC 
5 9.375 206,760 22,447 | 184.318 
10 21.146 205,086 22,613 182.473 
20 44.947 199,781 22,743 177.038 
30 62.048 199,046 22,743 176.030 
40 81.929 199,046 22,743 176.030 
50 129.034 199,046 22,743 176.030 
100 421.410 199,046 22,743 176.030 


the optimal solution obtained by CPLEX is reported under the column titled 
“CPLEX obj.”. For each solution approach, 5 runs of experiment are performed, 
the average, maximum and minimum of the objective values are also shown in 
Table 5. 

On average, GS_SAA with 10 scenarios produces better solution for the ini- 
tial population. This result is not surprising since the solution quality of the 
approximation of the objective function (6) by its sample average increases with 
increasing number of scenarios as demonstrated in Table 4. 

To test the impact of the number of scenarios on the performance of GS_SAA, 
we consider the relative improvement in the objective function value over the best 
solution of the initial population and it can be calculated by (In. — 0bj.)/In. x 
100%. On average, the relative improvement in the objective function value 
over the best solution of the initial population is approximately 27.5%, 28.8%, 
19.9%, respectively, for 1, 5, 10 scenarios. The small relative improvement of 
GS_SAA with 10 scenarios is due to the amount of time required for solving 
SAA-sequence model; as a result, only a few generations are explored and the 
search capability of genetic algorithm is not fully employed. This observation 
suggests that GS_SAA with 10 scenarios can potentially give better solution if 
it is allowed to run for a longer period of time. 

Furthermore, as shown in Table 5, the performance of GS_SAA is consistently 
better than MIM irrespective of the number of scenarios used. The method used 
in MIM to transform a sequence into a feasible arrival plan is just a simple greedy 
heuristic and it suffers from poor performance as the idle time inserted between 
train-sets is not optimal. On the other hand, given the sequence, GS_SAA gener- 
ates an optimal arrival plan by solving the corresponding SAA-sequence model 
using CPLEX. 

Compare GS_SAA with SAA-5, it can be noted that the latter outperforms 
GS_SAA in all runs of experiment. This reveals that a good sequence is crucial 
to obtaining a good solution. However, as demonstrated in Table 1, solving SAA 
model by itself is computationally challenging if the number of scenarios increases 
or the problem size becomes large (CPLEX cannot obtain optimal solution in 
1 h given the problem size in this paper). This shows the limitation of SAA 
model in large applications. 
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Table 5. Comparison of the performance of the different solution approaches (solution 
quality). 


Run MIM GS_SAA SAA-5 
1 scenario 5 scenarios 10 scenarios 
In. obj. In. obj. In. obj. In. obj. CPLEX obj. 
1 607,466 | 475,601 | 343,337 | 248,974 | 290,096 | 247,636 | 331,497 | 230,787 | 218,949 
2 533,746 | 475,601 | 340,894 | 227,894 | 344,851 | 238,421 | 330,219 | 246,854 | 226,197 
3 624,286 | 475,601 | 379,608 | 242,879 | 370,441 | 238,200 | 314,281 | 280,038 | 213,755 
4 629,515 | 475,601 | 344,171 | 331,267 | 338,932 | 234,148 | 331,583 | 282,617 | 212,428 
5 593,304 | 475,601 | 320,719 | 300,493 | 345,213 | 243,890 | 348,886 | 286,488 | 216,267 
Average 597,663 | 475,601 | 345,745 | 270,301 | 337,906 | 240,495 | 331,293 | 265,357 | 217,519 
Max 629,515 | 475,601 | 379,607 | 331,267 | 370,441 | 247,636 | 348,886 | 286,488 | 226,197 
Min 533,746 | 475,601 | 320,719 | 227,894 | 290,096 | 234,148 | 314,281 | 230,787 | 212,428 


In. is the best solution of the initial population. 
obj. is the objective value. CPLEX obj. is the optimal solution obtained by CPLEX 


Table6 shows the computation times in seconds obtained by MIM, GS_SAA 
with 1, 5 and 10 scenarios, and SAA-5. On average, SAA-5 can be solved to 
optimality in 288.35 s, whereas MIM takes more than 1288.55 s, and GS_SAA 
takes 1972.81, 3844.88, 4137.58 s, respectively, for 1, 5, 10 scenarios. Both MIM 
and GS_SAA require significant computation time because both methods could 
only terminate after the predefined maximum number of generations and time 
limit are reached. 


Table 6. Comparison of the performance of the different solution approaches (Com- 
putation time). 


Run MIM GS_SAA SAA-5 
1 scenario | 5 scenarios | 10 scenarios 
1 1263.52 | 2053.31 3796.17 4132.10 137.55 
2 1255.65 | 2036.93 3921.30 4104.96 202.65 
3 1439.97 | 1903.90 3803.61 4051.59 135.22 
4 888.78 1881.32 3829.91 4097.07 254.10 
5 1594.85 | 1988.57 3873.41 4300.58 712.23 
Average | 1288.55 | 1972.81 3844.88 4137.58 288.35 
Max 1594.85 | 2053.31 3921.30 4300.58 712.23 
Min 888.78 1881.32 3796.18 4051.59 135.22 
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7 Conclusions 


This paper examines the overhaul maintenance scheduling problem with stochas- 
tic cycle time. We present the mathematical formulation of SOMSP and show 
how it can be formulated as a mixed integer programming model using the sam- 
ple average approximation approach. A combined genetic search with SAA is 
developed to solve the problem. The result shows that solving the SAA model 
by itself is computationally challenging as the number of scenarios become large. 
However, if a sequence of train-sets is given, the computation time decreases sig- 
nificantly. Computational results reveal that the proposed GS-SAA can produce 
good solution within acceptable time for test problem consisting of 35 train-sets 
and a planning horizon of one year. 

In conclusion, SAA can be used for small instances since CPLEX can obtain 
optimal solution within acceptable computation time. For larger application, 
GS_SAA is the preferred choice. It is noted that the performance of GS_SAA 
depends on having both a good sequence and good inserted idle time. Future 
work can investigate methods to further reduce the computation time of SAA- 
sequence model to enable the search capability of the genetic algorithm. 
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Abstract. Missing data frequently occur in many applied domains and 
pose serious problems such as loss of efficiency and unreliable results 
for various approaches. Many real applications require complete data, 
thus, the filling procedure is a mandatory and precursory pre-processing 
step. DTWBI is a previously proposed method to estimate missing data 
in univariate time series with recurrent data. This paper introduces an 
extension of DTWBI, namely eDTWBI. Firstly, we simultaneously find 
the two most similar windows to the sub-sequences before and after a 
gap using DT WBI. Secondly, we impute the gap by average values of the 
following and previous sub-sequence of the most similar values. Exper- 
imental results on three datasets show that our approach outperforms 
than seven related methods in case of time series having effective infor- 
mation. 


Keywords: Imputation - Missing data - Univariate time series - 
Dynamic time warping - Similarity 


1 Introduction 


Lots of useful information can be exploited from collected time series and they 
are used in different domains such as economics [24], finance area [3], health-care 
[7], meteorology [4,19] and traffic engineering [16]. But the collected data are 
usually incomplete for various reasons as sensor errors, transmission problems, 
incorrect measurements, bad weather conditions (outdoor sensors) to manual 
maintain, etc. Missing data can generate inaccurate data interpretation, biased 
and unreliable results [8]. Moreover, most of proposed models for time series 
analysis suffer from one major drawback, which is their inability to process 
incomplete datasets, despite their powerful techniques. An easy way is to delete 
or ignore missing data. But this solution comes at high price because of losing 
valuable information especially for time series where considered values depend 
on the past ones. So, replacing missing data is a mandatory and precursory pre- 
processing task. The imputation technique is a conventional method to handle 
the this problem [11]. 
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Imputation methods can be categorized into 2 types: (1) multivariate impu- 
tation techniques and (2) univariate imputation approaches. For the first type, 
these techniques take advantages of relations between variables to estimate miss- 
ing data [6,7,21,22]. These methods handle incomplete data by filling missing 
features based on observable ones. They usually train separate models, such as 
missForest [21], ELM (extreme learning machines) [20], MLP (multi-layer per- 
ceptron) [10], etc., for estimating the unobserved attributes. 

However, when dealing with missing data in univariate time series, we can 
only exploit available observations of this variable to predict incompleteness 
data. Moritz et al. pointed out this task is a particularly challenging [13]. And, 
they performed a review of various methods for univariate time series and showed 
limitations of some other approaches in [13]. Fewer studies investigates to fill 
missing data in univariate time series. Simple methods are often used as mean [1], 
median [5], locf (last observation carried forward), linear interpolation or spline 
interpolation. For the interpolation methods, missing data are estimated from 
preceding and succeeding values of the univariate time series. These techniques 
are effective when the missing data type is isolated (one missing point) or small 
gap. But when the gap is large, i.e., many consecutive missing values, they do 
not give good results. For example, if a gap has a sine wave shape, the linear 
interpolation would complete the gap by a straight line. In addition, we also 
use statistical methods (e.g. ARMA or ARIMA) to complete missing data in 
univariate time series but these models require linear data after differencing [2]. 

Therefore, it is necessary to propose effective imputation methods for uni- 
variate time series and consider the characteristics of data, especially for complex 
distribution data. 

In our previous study [15], we proposed DTWBI approach which enables 
to impute large consecutive missing values in univariate time series. DTWBI 
is based on the combination of the shape-feature extraction algorithm [14] and 
Dynamic Time Warping method [17]. In this study, we define a large gap when 
number of consecutive missing points is larger than the known-process change, so 
it depends on each application. In order to improve imputation ability, we intro- 
duce a novel and effective method for univariate time series, namely eDTWBI 
which is an extension of DI WEBI. Besides, we compare the proposed method 
with heuristic approach (called Random method) and study the performance of 
conserving frequency information of all considered methods after the imputation. 

This paper is organized as follows. Section 2 focuses on the proposed method. 
Next, Sect. 3 introduces our experiments, results and discussion. Finally, conclu- 
sions are drawn and future work is presented. 


2 The Proposed Method: eDTWBI 


In the DTWBI algorithm we only envisaged one query either before or after the 
considered gap. In this study, we modify DTWBI by taking into account two 
queries, one query before and one query after this gap. Moreover, data before 
and data after the gap will be treated as two referenced univariate time series. 
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This would, on the one hand, enrich the learning base and, consequently, increase 
the prediction ability of the method. On the other hand, this allows to consider 
dynamics (important key) of data before and after the considered gap to estimate 
imputation values and to relax temporal constraints between two queries. 


10 15 20 


Query before 
the gap 


5 
| 


Final imputed values 


5 10 15 20 


Fig. 1. Scheme of eDTWBI for the imputation task: 1-Queries building, 2-Sliding win- 
dows comparison, 3-Most similar windows selection, 4-Gap filling 


The eDTWBI algorithm is implemented in order to always ensure accurate 
results. First, if the position of a gap (with size T) is in the first 2 x T of database, 
only DTWBI is applied on the remaining data after the gap. If the gap position 
is in the last 2 x T of series, only DT WBI is performed on data before the gap. 
In the case of missing data locates in the middle of the series, i.e between 2 x T 
and N —2 x T (where N is the length of the time series), eDTWBI is applied to 
impute missing data. Figure 1 illustrates the mechanism of eDTWBI to fill large 
missing data in univariate time series and the detailed algorithm is described in 
Algorithm 1. This approach consists of three main phases described as follows: 

The first phase - Queries building (cf. 1 in Fig. 1): For each T-gap, two 
referenced time series are extracted from the original signal and two queries are 
created. The data before the gap (namely Db) and the data after this gap (noted 
Da) are treated as two separated time series. We noted Qb is the sub-sequence 
before the gap and Qa is the sub-sequence after the gap, respectively. Qa and 
Qb queries have the same size T as the gap. 

The second phase - Retrieving the most similar windows (cf. 2 & 3 in 
Fig. 1): For the Da database, sliding reference windows (denoted R) of size T 
are built. From these R windows, we use DTWBI method [15] to find the most 
similar window (Qas) to Qa. The same process is carried out to retrieve the 
most similar window Qbs in Db data. 

A key-point of the eDTWBI approach is to envisage the dynamics and shape 
of data before and after a gap. This means that two queries before and after the 


124 T.-T.-H. Phan et al. 


Algorithm 1. eDTWBI algorithm 
Input: X = {x1,xv2,...,¢w}: incomplete time series 
t: index of a gap (position of the first missing of the gap) 
T: size of the gap 
6_cos: cosine threshold (< 1) 
step_threshold: increment for finding a threshold 
step_sim_win: increment for finding a similar window 
Output: Y - completed (imputed) time series 


1: For each gap ContainsMissing(X) do: 

2: Step 1: Divide X into two separated time series Da, Db: Da = X[t+T : N], Db= 
X([1:t-1] 

: Step 2: Construct queries Qa, Qb - temporal window after and before the missing 
data Qa = Da[1:T]; Qb= Db{t -T+1:t-1] 

4: For Db data do 

5: Step 3: Find the threshold on the Db data 

6: i— 1; DTW_costs — NULL 

7 

8 


ow 


: while i <= length(Db) do 
: k-i+T-1 
9: Create a reference window: R(i) = Db{i: k] 
10: Calculate global feature of Qb and R(i): gf Qb, gf R 
11: Compute cosine coefficient: cos = cosine(gfQb, gf R) 
12: if cos > 6_cos then 
13: Calculate DTW cost: cost = DI'W -cost(Qb, R(i)) 
14: Save the cost to DTW _costs 
15: i<—i+step_threshold 
16: threshold = min{ DTW _costs} 
17: Step 4: Find similar windows on the Db data 
18: i 1; Lopb — NULL 
19: while i <= length(Db) do 
20: k-i+T-1 
21: Create a reference window: R(t) = Db{i : k] 
22: Calculate global feature of Qb and R(i): gfQb, gf R 
23: Compute cosine coefficient: cos = cosine(gfQb, gf R) 
24: if cos > 6_cos then 


25: Calculate DTW cost: cost = DTW _cost(Qb, R(i)) 
26: if cost < threshold then 

27: Save position of R(i) to Lopb 

28: 1<— 1+ step_sim_win 


29: return Qbs - the most similar window to Qb having the minimum DTW cost in 
the Lopb list. 

30: For Da data do 

31: Perform step 3 and 4 with Da data 

32: return Qas - the most similar window to Qa 

33: Step 5: Replace the missing values at the position t by average vector of the window 
after the Qbs and the one previous the Qas 
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studied gap we considered. This allows to detect windows that have the most 
similar dynamics and shape to the queries. 

The third phase - Completing the gap (cf. 4 in Fig. 1): When the two most 
similar windows are found, we impute the gap by averaging values of the previous 
window of Qas and the following window of @bs. In the eDTWBI approach, the 
average values are used because Schomaker and Heumann indicated that model 
averaging makes the final results more stable and unbiased [18]. 


3 Experiments 


To illustrate performance of the proposed method, we evaluate it and compare 
with other imputation methods including DTWBI [15], Kalman [12], na.interp 
[9], na.locf, na.aggregate and na.spline [25] and heuristic method. To perform 
the last comparison, we randomly chose 10 windows having the same size of the 
gap, then compute the average values to fill in the gap. 


3.1 Data Description 


Four time series are utilized to perform experiments including monthly mean 
C02 concentrations [23], daily mean air temperature at the Cua Ong meteoro- 
logical station, monthly mean air temperature and humidity at the Phu Lien 
meteorological station, in Vietnam. In order to obtain useful information from 
the datasets and to make the datasets easily exploitable, we analyzed these 
series. Table 1 summarizes their characteristics. These datasets have a seasonal- 
ity component (i.e. an annual cycle) without any linear trend. The seasonality 
component that would be respected after the imputation but they don’t have 
regular amplitude. 


Table 1. Characteristics of time series 


No Dataset name Period ##Samples|Seasonality Trend /|Frequencey 
(Y/N) (Y/N) 

1 |CO2 concentrations 1974-1987) 160 Y N Monthly 

2 Phu Lien humidity 1961-2015} 692 Y N Monthly 

3 Phu Lien air temperature | 1961—2014| 684 Y N Monthly 

4 (Cua Ong air temperature | 1973-1999 | 9859 ¥ N Daily 


1. CO2 concentrations - This dataset contains monthly mean CO2 concentra- 
tions at the Mauna Loa Observatory from 1974 to 1987 ([23]). 

2. Phu Lien humidity - This dataset, containing monthly mean air humidity at 
the Phu Lien meteorological station in Vietnam, was collected from 1/1961 
to 8/2015. 

3. Phu Lien air temperature - This dataset is composed of monthly mean air 
temperature at the Phu Lien meteorological station in Vietnam from 1/1961 
to 12/2014. 
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4. Cua Ong temperature - daily mean air temperature at the Cua Ong meteo- 
rological station in Vietnam from 1/1/1973 to 31/12/1999. 


3.2. Experiment Process 


Actually, assessing the performance of imputation methods can not be done 
because the real values are missing. Thus, we must generate artificial missing 
data on complete time series in order to compare the ability of imputation meth- 
ods. A technique of three steps is used to conduct experiments described in detail 
as follows: 


— The first step: Simulated missing data are produced by deleting data segments 
from each time series with different size of consecutive values. 

— The second step: All imputation algorithms are applied to estimate the miss- 
ing values 

— The third step: The true values and imputed data (generated from different 
approaches above-mentioned) are compared. 


Here, 5 missing data levels are considered on 4 datasets. For CO2 and Phu 
Lien series, the imputation size ranges from 6%, 7.5%, 10%, 12.5% and 15% of 
their size respectively. For Ong Ong series, this is a quite big dataset, so gaps are 
created with size of 3%, 3.75%, 5%, 6.25% and 7.5% dataset length (the largest 
gap of this time series is 739 missing points i.e. equivalent to more than 2 years 
of missing data). 

For each missing rate in a dataset, 10 missing positions are randomly chosen 
and all the algorithms are conducted. 


3.3. Imputation Performance Indicator 


After completing missing data, experiment results are discussed in two parts 
viz., quantitative performance and visual ability. Specially, the quantitative per- 
formance is analyzed in amplitude, shape and frequency criteria. To compare the 
amplitude between imputation values and actual ones, we use Similarity (Sim), 
an adapted Normalized Mean Absolute Error (NMAE), Root Mean Square Error 
(RMSE). Fractional Bias (FB) is applied to compare the shape between pre- 
diction data and real data. To assess the ability of frequency conservation of 
imputation methods, we perform a comparison between the seasonality compo- 
nents of the full series and the imputed signal using NMAE (denoted NMAE(s)). 
These indicators are computed as following: 


1. Similarity - defines the similar percentage between the imputed value (Y) and 
the respective true values (X). It is calculated by: 


. 1 1 
Sim(Y, X) = ys ‘rae TEC (1) 
i=1 maz(X)—min(X) 
Where T is the number of missing values. A higher similarity (€ [0,1]) high- 
lights a better ability to complete missing values. 
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2. NMAE, the Normalized Mean Absolute Error between the imputed value Y 
and the respective true value time series X is computed as: 


yi — 2: 
Vinax = Vinin 


oa 
NMAE(Y, X) = =y (2) 


where Vinaz, Vmin are the maximum and the minimum value of original time 
series. A lower NMAE means better performance method for the imputation 
task. 

3. RMSE: The Root Mean Square Error is defined as the average squared differ- 
ence between the imputed value Y and the respective true value time series 
X. This indicator is very useful for measuring overall precision or accuracy. 
In general, the more effective method would have a lower RMSE. 


RMSE(Y, X) = 


4. FB (Fractional Bias) is defined by: 


mean(Y) — mean(X) 
mean(Y) + mean(X) 


FB(Y,X)=2* 
A model is considered perfect when its FB tends to 0. 


3.4 Experiments Results 


This part presents experiment results obtained from the proposed approach and 
compares its ability with the seven published methods. 

Tables 2 and 3 show the averaged performance of different imputation meth- 
ods on 4 datasets for the 5 indicators previously mentioned. These results confirm 
that eDTWBI is more effective than compared methods in most of the cases, 
especially in relative high missing rate scenario. 

On the CO2 and Phu Lien temperature series, eDTWBI provides the highest 
Similarity, the lowest NMAE, RMSE and NMAE(s) at nearly every missing 
ratio (excluding NMAE(s) at 7.5% on the CO2 series). The results demonstrate 
that the imputation values using eDTWBI method are close to the real values, 
especially for large missing size (at 12.5% and 15% on Phu Lien temperature and 
CO2 series). In addition, our method is capable of preserving frequency, which is 
disclosed on the NMAE(s) index. FB is a quantitative index that allows a shape 
comparison between predicted and true values. When looking at FB index on 
Tables 2 and 3, FB values of eDTWBI are the smallest in the majority of missing 
rates and ranked the second at some levels like 10%, 12.5% on CO2 series, at 
15% on Phu Lien temperature, and ranked the 3@ at 7.5%, 10% and 15% on 
Phu Lien humidity signal. Again, the results show that the improved ability to 
estimate missing values of eDTWBI method in terms of shape. 
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Table 2. Average imputation performance indices of various imputation algorithms 
on CO2 and Phu Lien temperature datasets 


Method Gap size | CO2 Phu Lien temperature 
Sim NMAE RMSE FB NMAE (s) | Sim NMAE RMSE FB NMAE (s) 

Random |6% 0.625 0.196 5.22 0.013 0.03 0.779 0.237 4.82 0.018 0.021 
Kalman 0.754 0.097 2.768 0.006 0.019 0.58 0.733 14.924 0.48 0.02 
DTWBI 0.832 0.055 1.509 0.004 0.009 0.878 0.114 2.576 0.023 0.009 
eDTWBI 0.919 0.024 0.693 0.001 0.006 0.916 0.075 1.7 0.01 0.005 
na.interp 0.731 0.106 2.973 0.006 0.022 0.778 0.244 5.28 0.062 0.02 
na.locf 0.721 0.114 3.22 0.006 0.024 0.775 0.257 5.718 0.15 0.019 
Aggregate 0.636 0.18 4.802 0.012 0.028 0.791 0.216 4.379 0.016 0.019 
na.spline 0.764 0.092 2.66 0.006 0.019 0.599 0.694 14.379 0.433 0.02 
Random | 7.5% 0.671 0.153 4.042 0.009 0.022 0.798 0.227 4.607 0.014 0.026 
Kalman 0.726 0.126 3.607 0.008 0.024 0.534 0.993 19.84 1.273 0.027 
DTWBI 0.798 0.068 1.731 0.004 0.008 0.883 0.119 2.631 0.022 0.012 
eDTWBI 0.889 0.034 0.924 0.001 0.01 0.913 0.086 1.983 0.011 0.008 
na.interp 0.737 0.105 3.026 0.005 0.026 0.772 0.281 6.071 0.144 0.026 
na.locf 0.725 0.115 3.359 0.008 0.024 0.776 0.273 5.8 0.152 0.025 
Aggregate 0.681 0.14 3.846 0.009 0.024 0.797 0.228 4.605 0.013 0.026 
na.spline 0.741 0.117 3.414 0.008 0.023 0.547 0.957 19.432 1.241 0.027 
Random | 10% 0.644 0.196 5.145 0.013 0.041 0.797 0.236 4.802 0.013 0.035 
Kalman 0.71 0.15 4.572 0.009 0.033 0.484 1.3 26.479 1.58 0.035 
DTWBI 0.735 0.122 3.595 0.009 0.035 0.885 0.12 2.691 0.021 0.016 
eDTWBI 0.804 0.082 2.271 0.004 0.025 0.912 0.089 2.065 0.009 0.011 
na.interp 0.777 0.09 2.54 0.002 0.031 0.787 0.255 5.395 0.029 0.035 
na.locf 0.74 0.114 3.202 0.006 0.03 0.775 0.293 6.49 0.189 0.034 
Aggregate 0.629 0.204 5.344 0.014 0.038 0.799 0.23 4.644 0.014 0.034 
na.spline 0.658 0.591 19.906 0.046 0.043 0.475 1.322 27.19 3.809 0.035 
Random | 12.5% 0.634 0.199 5.262 0.014 0.047 0.797 0.234 4.756 0.009 0.043 
Kalman 0.761 0.107 3.238 0.003 0.041 0.622 0.722 15.389 0.372 0.042 
DTWBI 0.731 0.122 3.508 0.008 0.036 0.879 0.128 2.835 0.013 0.021 
eDTWBI 0.804 0.083 2.43 0.005 0.031 0.901 0.101 2.304 0.008 0.016 
na.interp 0.767 0.1 2.886 0.006 0.044 0.78 0.282 6.263 0.171 0.042 
na.locf 0.744 0.117 3.338 0.007 0.048 0.763 0.315 6.95 0.229 0.043 
Aggregate 0.63 0.206 5.446 0.014 0.043 0.803 0.225 4.547 0.009 0.042 
na.spline 0.756 0.113 3.533 0.005 0.041 0.537 1.112 23.34 1.149 0.042 
Random 15% 0.674 0.199 5.308 0.014 0.047 0.798 0.234 4.731 0.007 0.052 
Kalman 0.651 0.233 6.319 0.015 0.053 0.45 1.45 29.078 9.551 0.052 
DTWBI 0.747 0.124 3.313 0.008 0.033 0.886 0.12 2.684 0.012 0.023 
eDTWBI 0.831 0.082 2.297 0.005 0.031 0.897 0.107 2.388 0.008 0.021 
na.interp 0.771 0.115 3.311 0.007 0.05 0.782 0.277 6.192 0.149 0.051 
na.locf 0.744 0.135 3.794 0.008 0.05 0.777 0.288 6.404 0.191 0.05 
Aggregate 0.699 0.176 4.778 0.011 0.05 0.801 0.227 4.585 0.008 0.05 
na.spline 0.662 0.223 6.14 0.015 0.051 0.527 1.097 22.954 1.278 0.051 


Cua Ong time series is long so we pay special attention to the shape and 
dynamics of the imputation values. This is very important when we fill in large 
missing data. Therefore, we take into account another index, FA2. It represents 
the fraction of data points that satisfied smoothing amplitude cover. This indi- 


cator is calculated as FA2(Y,X) = 


length(0.5<% <2) 


length(xX) 
if FA2 is closer to 1, the imputation values are closer to the real values. When 


. For the imputation task, 
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Table 3. Average imputation performance indices of various imputation algorithms 
on Phu Lien humidity and Cua Ong series 


Method Gap size | Phu Lien humidity Cua Ong temperature 

Sim NMAE RMSE FB Sim NMAE RMSE FB FA2 
Random 6% 0.858 0.135 6.8 0.021 0.021 | 0.83 0.18 54.7 0.042 0.98 
Kalman 0.845 0.153 7A 0.041 0.019 | 0.83 0.19 58.6 0.152 0.97 
DTWBI 0.861 0.132 6.2 0.023 0.011 | 0.901 0.10 33.3 0.022 0.993 
eDTWBI 0.877 0.114 5.6 0.018 0.011 | 0.906 0.09 31.3 0.028 1.00 
na.interp 0.828 0.176 8.3 0.054 0.02 0.83 0.19 58.6 0.152 0.97 
na.locf 0.786 0.236 10.4 0.096 0.021 | 0.80 0.23 72.2 0.18 0.95 
Aggregate 0.865 0.126 6.3 0.019 0.019 | 0.82 0.19 56.3 0.042 0.98 
na.spline 0.534 0.908 37.4 0.4 0.02 0.39 2.43 727.6 2.077 0.21 
Random 7.5% 0.84 0.125 5.9 0.02 0.019 | 0.84 0.18 53.2 0.014 0.98 
Kalman 0.834 0.132 6.0 0.026 0.02 0.81 0.22 68.5 0.145 0.95 
DTWBI 0.851 0.118 5.7 0.016 0.011 | 0.89 0.10 34.7 0.016 0.99 
eDTWBI 0.859 0.11 5.3 0.022 0.01 0.912 0.09 30.2 0.02 0.994 
na.interp 0.844 0.125 5.9 0.022 0.02 0.81 0.22 68.5 0.145 0.95 
na.locf 0.821 0.149 6.9 0.047 0.019 | 0.81 0.21 67.4 0.154 0.96 
Aggregate 0.84 0.124 5.8 0.02 0.019 | 0.83 0.18 54.1 0.014 0.98 
na.spline 0.459 1.193 51.9 0.437 0.02 0.31 3.91 1153.9 2.256 0.15 
Random 10% 0.865 0.124 6.1 0.008 0.031 | 0.83 0.19 57.8 0.052 0.98 
Kalman 0.85 0.143 6.9 0.037 0.031 | 0.83 0.20 62.1 0.056 0.97 
DTWBI 0.859 0.134 6.8 0.015 0.025 | 0.903 0.10 32.6 0.025 0.99 
eDTWBI 0.864 0.127 6.3 0.012 0.023 | 0.912 0.09 28.7 0.022 0.999 
na.interp 0.84 0.155 7.3 0.047 0.031 | 0.83 0.20 62.1 0.056 0.97 
na.locf 0.834 0.163 7.6 0.05 0.031 | 0.83 0.20 64.1 0.134 0.97 
aggregate 0.872 0.116 5.8 0.008 0.03 0.83 0.18 54.9 0.051 0.98 
na.spline 0.423 1.817 74.8 0.69 0.032 | 0.34 3.42 1005.6 3.035 0.18 
Random 12.5% 0.866 0.122 6.0 0.012 0.036 | 0.82 0.21 61.7 0.031 0.97 
Kalman 0.85 0.143 6.8 0.026 0.037 | 0.82 0.21 66.8 0.153 0.97 
DTWBI 0.867 0.122 6.0 0.013 0.019 | 0.9 0.11 35.9 0.023 0.991 
eDTWBI 0.874 0.115 5.8 0.009 0.019 | 0.91 0.09 31.7 0.02 0.998 
na.interp 0.821 0.179 8.3 0.048 0.037 | 0.82 0.21 66.8 0.153 0.97 
na.locf 0.786 0.221 9.5 0.086 0.035 | 0.82 0.22 67.9 0.159 0.96 
Aggregate 0.873 0.116 5.8 0.009 0.036 | 0.84 0.18 54.3 0.027 0.98 
na.spline 0.39 2.103 87.0 1.833 0.04 0.28 5.38 1625.1 2.045 0.13 
Random 15% 0.859 0.135 6.2 0.019 0.042 | 0.84 0.18 53.9 0.013 0.98 
Kalman 0.871 0.125 6.0 0.017 0.039 | 0.82 0.22 68.2 0.157 0.96 
DTWBI 0.863 0.133 6.4 0.026 0.023 | 0.91 0.10 33.1 0.018 0.99 
eDTWBI 0.87 0.123 5.9 0.02 0.023 | 0.913 0.09 30.4 0.018 0.999 
na.interp 0.865 0.133 6.3 0.02 0.039 | 0.82 0.22 68.2 0.157 0.96 
na.locf 0.854 0.148 7.0 0.041 0.039 | 0.82 0.21 66.4 0.148 0.97 
Aggregate 0.867 0.125 5.9 0.018 0.039 | 0.84 0.18 53.9 0.01 0.98 
na.spline 0.314 2.674 111.0 1.816 0.04 0.23 6.55 1901.0 6.482 0.08 


looking at the results in Table3, eDTWBI method proves its superior ability 
compared to other methods for the task of completing missing data on large 
datasets. It provides the highest Similarity and FA2, the lowest NMAE and 
RMSE at every missing levels. 

Besides, the visualization performance of imputed values generated from dif- 
ferent methods is studied. Figure 2 presents the shape of imputation data using 7 
different methods on CO2 series. DT WBI well respects the shape of real values. 
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Fig. 2. Visual comparison of imputed values of different univariate methods with true 
values on CO2 series with the gap size of 20 (12.5%) at postion 89. 


But when comparing it with eDTWBI, this approach proves again its ability to 
deal with missing data. The dynamics of prediction values yielded by eDTWBI 
is almost identical to the form of true values. 

Although the FB value of Kalman method is the 1%* on the CO2 and Phu 
Lien humidity series at 12.5% missing ratio (Tables 2 and 3), but when looking 
at Fig.2, it clearly shows that the amplitude and shape of imputation values 
produced by Kalman differ greatly from the acutal values. 

Figure 2 also shows that three methods, including na.aggregate, na.locf and 
na.interp, always provide a straight line even though quantitative indicators are 
quite good (Table 2). This means that they do not respect the shape of true 
values. Random is heuristic method but it provides better results than Kalman 
or spline in most cases on the Phu Lien series (Table 2). 


4 Conclusions and Future Work 


This paper proposes a new method, namely eDTWBI, for imputing missing data 
in univariate time series. The eDTWBI method is an extension of DTWBI by 
finding the similar values in both databases before and after each gap. It is 
evaluated and compared with seven other methods on 4 datasets using differ- 
ent criteria (amplitude, frequency and shape constraints). The obtained results 
clearly demonstrate that our method provides improved performance. However, 
eDTWBI is based on an assumption of recurring data. In the future, we intend 
to combine eDTWBI with other algorithms such as interpolation methods to 
effectually complete missing data in every type of univariate time series. 
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Abstract. In this work, we consider the Robust Principal Components 
Analysis, a popular method of dimensionality reduction. The correspond- 
ing optimization involves the minimization of [p-norm which is known to 
be NP-hard. To deal with this problem, we replace the lo-norm by a 
non-convex approximation, namely capped /J;-norm. The resulting opti- 
mization problem is non-convex for which we develop a reweighted 1; 
based algorithm. Numerical experiments on several synthetic datasets 
illustrate the efficiency of our algorithm and its superiority comparing to 
several state-of-the-art algorithms. 


Keywords: Robust principal component analysis - Sparse 
optimization - Non-convex optimization - Reweighted-l, 


1 Introduction 


Principal Component Analysis (PCA) is certainly one of the most popular meth- 
ods of dimensionality reduction with a very wide range of applications such as 
data visualization, image compression, bio-informatics, etc. PCA tries to inter- 
pret the data by assuming that it lies on a space of lower dimension. Formally, let 
M €R™*” be the data matrix whose each column is a data point. One assumes 
that M can be approximatde by the sum of two matrices [ + S where L is a 
rank-r matrix with r << min(m,n) and S represents a small noisy perturbation 
of each data point of L. Thus, the PCA which consists on seeking the best rank-r 
matrix L can be reformulated as follows 
min ||M—L|p 
L,SER™*” 
st M=L+45S, (1) 
rank(L) <r, 


where |j.|| 7 is the Frobenius norm. The above optimization problem can be effi- 
ciently solved using the singular value decomposition (SVD). Moreover, under 
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the assumption that the data matrix M is corrupted by a small Gaussian noise S, 
PCA enjoys several optimality properties [10]. However, it is well known that the 
biggest drawback of PCA is that it is highly sensitive to outliers, e.g., even a sin- 
gle grossly corrupted entry in the data may break down PCA [12]. Unfortunately, 
in many machine learning applications, it happens often that data contains gross 
errors. Numerous methods have been developed such as multivariate trimming, 
alternating minimization, random sampling techniques, etc in order to improve 
the robustness of PCA [1]. Robust PCA (RPCA) is arguably one of most efficient 
approaches for this purpose. Unlike PCA, in PCA, one considers that the noise 
matrix S' is sparse and can contain highly corrupted measurements. RPCA aims 
to recover the data matrix M with the lowest-rank matrix L and the sparsest 
error matrix S. Thus, RPCA can be formulated as: 


min  rank(L) + Al|S|lo 
L,SER™” (2) 


st M=L+S, 


where \ > 0 is the trade-off parameters between two terms. Moreover, let p = 
min(m,n) and o(L) = (01(L),...,op(L)) be singular values of L in descending 
order. It is obvious that rank(L) = ||o(L)||p. Therefore, the problem (2) can be 
equivalently expressed in the form 


i L 
aun, llo(Z)|lo + AlStlo s 


st M=L+S8S 


Hence, the RPCA problem involves the minimization of |].||o which is known 
to be NP-hard. Sparse optimization plays a very important role, especially in 
machine learning. Numerous methods have been developed for sparse optimiza- 
tion. The readers are referred to the paper [4] for an extensive survey on existing 
methods for sparse optimization. Among the methods for RPCA, most of the 
recent researches focus on solving an approximate problem of RPCA by replac- 
ing the non-convex functions rank(L)/||o(L)|| and ||.S'||o by convex functions. In 
[1], Candes et al. proposed the so called Principal Component Pursuit (PCP) 
defined as 


min | |[Zl]* + All Sila 
L,SER™x”n 


st M=L+45S, 


(4) 


where rank(L) is approximated by the nuclear norm ||L£]|, = ||o(Z)||1 = 3 oi(L) 
i=l 
and ||S||o is replaced by its convex approximation ||S$||1. Thus, PCP problem is 
a convex problem for which tractable convex optimization techniques can be 
investigated. Candes et al. [1] developed an Alternating Direction Multiplier 
Method (ADMM) for solving (4). Furthermore, the authors proved that, under 
some weak assumptions, solving (4) can exactly recovers the low-rank EL and the 
sparse error matrix S. Different methods such as Inexact Augmented Lagrangian 
method (Inexact ALM) [6], Alternating Direction Method [6], etc. were also 
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developed for solving (3). In [10], Wright et al. considered a variant of PCP, 
namely Stable RPCA, where the constraint M = L+4 S is replaced by ||M—LI— 
S||2, < €. Hence (4) becomes 


: bt 2 
sa =< — <€. 
zanin.. lElls + AllSlla + 5 IM L-S|lp<e (5) 


Wright et al. [10] proved that as yz \, 0, the solutions of (5) approach the solution 
set of (4). 

Although PCP, Stable RPCA and their variants improve the robustness of 
PCA in case of grossly corrupted error matrix S, they still show limitations. The 
use of convex approximations to replace non-convex terms rank(L) and ||S||o in 
the RPCA makes the resulting problem convex, thus easier to solve. However, it 
is not enough to simply use /j-norm or l2-norm to model the noise, since unfortu- 
nately the real noise has often more complex structures than simple Gaussian or 
Laplacian error [11]. In [9], for the first time, Sun et al. used a non-convex func- 
tion, namely the capped /,;-norm, to replace the /p-norm. The capped /; function 
is then decomposed as a DC (Difference of two Convex) function and Sun et al. 
developed a DCA (DC Algorithm) to solve the resulting problem. DCA is well- 
known as an efficient approach in the nonconvex programming framework thanks 
to its versatility, flexibility, robustness, inexpensiveness and their adaptation to 
the specific structure of considered problems [3,7,8]. DCA solves a sequence of 
convex problems instead of solving the considered non-convex problem. In [9], 
the proposed DCA requires to solve convex sub-problems for which Augmented 
Lagrange Method of Multipliers (ADMM) was developed. Unfortunately, ADMM 
for solving the convex sub-problem is quite slow. 

Convinced by the necessity and efficiency of non-convex approximations of 
Igp-norm, in this work, we will also consider non-convex approximation. More 
precisely, we deal with the following formulation of RPCA: 


: LM 2 
L A||S —||M—L—S||p. 
ania, lo(Z)|lo + Also + FI I? (6) 

The remainder of the paper is organized as follows. In Sect. 2, we reformulate 
the RPCA formulation (6) using the capped ¢;-norm then develop a reweighted 
£; based algorithm to solve the resulting problem. Computational experiments 
are reported in Sect. 3 and Sect. 4 concludes the paper. 


2 Reweighted-l, Based Algorithm for RPCA 


Conventionally, to deal with the minimization of @o-norm, we replace the hard 
discontinuous f9-norm with an appropriate continuous approximation. Then the 
approximation problem would be more amenable to optimization methods, and 
in many cases, we can obtain an equivalent reformulation [4]. Following suc- 
cesses of the non-convex approximations in previous works (see [4] and references 
therein), we will approximate the @o-norm in (6) by the capped-@; function which 
is defined as follows 
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1 
ga(a) = Pmmin {lal 5 Vc ER, 


where 6 > 0 is the approximation parameter. Specifically, for 6 > 0 and x € R”, 
we have 


n 
lIzllo ~ >) poli). 
i=1 


Hence for 6; > 0 and 62 > 0, we consider the following approximation of problem 
(6), which will be called the capped principle component analysis (CaPCA) 
problem, 


P mn 
! 
min d yo, (oi(L)) +) >) Yo0(Sis) + SIM -L-Sllr (7) 


i=1 j=1 


In the next sections, we will present the reweighted-l, for spare optimization 
and then develop it to solve the CaPCA problem. 


2.1 Reweighted-l, for Spare Optimization 


In this section, we outline a reweighted-l, procedure for solving sparse optimiza- 
tion problems. More details on the subject can be found in [4] (Sect. 6.2 and line 
Tcap Of Table 3). 

Consider a sparse optimization problem of the form 


min S- po(xi) + f(x), 


zER” — 


where f : R” — R is a finite convex function. 

The idea of reweighted-l; for iteratively solving the above problem is 
described as follows. At each iteration k, we replace the nonconvex term 
1 Yo(zi) with the weighted 1; term >7j"_, z¥|ai|, where x* is the current 
iterate and weights z}’s are determined by z¥ = 0 if |x| < 1/0 and 0 otherwise. 
Then the next iterate x*+! will be a solution of the subproblem 


n 
. k 
min > 2 lel + f(2). 
t=1 
The above two steps are repeated until convergence. 


2.2 Reweighted-l1 for Solving the CaPCA Problem 


For solving problem (7), we will adapt the reweighted-¢; procedure described 
earlier by regarding o;(L)’s as variables. At iteration k, given the current iterate 
(L*, S*) we need to solve the following reweighted-¢; subproblem to get the next 
iterate (L*+1, §*+1) 
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Pp mn 
: : Lb 
min Yj atoi(L) +r) > Bil Sis] + SIM — L — Slip, (8) 
, i=1 i=1 j=1 
where 
p k 1 . k 1 
ak = 0, if aL") << a oer = Oo i Gl = Oo (9) 
0 otherwise, O otherwise. 


Problem (8) is generally nonconvex. Fortunately, if either DL or S is fixed, 
the other can be computed explicitly as described below. Thus, we can solve 
problem (8) by iteratively alternating between L and S. Before proceeding, let 
S,:R—R (with 7 > 0) denote the soft-thresholding operator 


S,(x) = sgn(a) max(|x| — 7,0) = max(#—7,0)+min(#+7,0). (10) 


Computing S with fixed L. Let A= M—L, then computing S amounts to solving 
the problem 


m n 


pci 1 
. iy) 2 
m »-> Sij| + = (Sig — Aug 
Genres j=l j=l bt | i 3 J i) 


which has closed-form solution given by 


Siz = Sak, (Aig) Vi, J. (11) 
Computing L with fixed S. With fixed S, problem (8) reduces to 
min 5 ps lin BI? (12) 
Ub a 9 Fo 


where B= M —S. 

Assume that B = UX'gV" is any singular value decomposition (SVD) and 
(o1(B),...,p(B)) = diag(2’g) are nonincreasingly ordered. Due to [2, Corollary 
7.4.1.3], we have 


1 pra 
oi(L) + SII — Bib > 


for any LD € R™*". 
It is easy to show that minimizing the right-hand side of (13) in o; = o;(L) 
with constraints 0; > 0, Vi, will give the solution o = (1,...,@»), where 


k 
a; = max (ots) — ) =S,% (ai(B)), Vi=1,...,p. (14) 
By letting L = Udiag(a)V", we see that equality holds in (13), while the right- 
hand side of (13) is minimized. Thus, L is the optimal solution of problem (12). 
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We summarize the proposed algorithm for solving problem (7) below. 


Algorithm 1. CaPCA 


Require: M € R™*", A >0, w > 0, 01 > 0, 02 > 0, €1 > 0, €2 > O 
L° = §° =0,k =0, p= min(m,n) 
repeat 
Compute weights a* € R? and 6” € R™*” using (9) 
Set tad, 2) = 7", 5) = 8" 
repeat 
Calculate Sf"+? = S yob, (Mig — Li) for all 4, 


Compute a SVD M — sh = yOVT and (o1,---,Op) = diag(Z) 


k 
Calculate o; = max (0: - “t,0) for anyi=1,...,p 
Let L**t! = Udiag(a1,...,0p)V~ 
Sett=t+1 


until ||jo(L'*') — a NG + eee <E1 
Set LAtt = pt gk+ 
Setk=k+1 

until |Jo(L") — o(L*~)||/(1 + |lo(L*)|)) < e2 


The following result, whose proof is omitted due to paper’s limited length, 
gives the validity of Algorithm 1. 


Theorem 1. Let F(L,S) denote the objective function in (7). Suppose that 
(Lk-1, 9*-1) and (L*, S*) with k > 1 be two consecutive iterates of Algorithm 
1. Then we have 

PS ar 8), 


3 Numerical Experiments 


Dataset. For each quadruple (m,n,r,cp) where r << min(m,n) is the dimen- 
sion of target space and c, € (0,1) represents the sparsity ratio of error matrix, 
the data matrix M € R™*” is generated as follows. We first generate the rank—r 
matrix Lo = UV" where U € R™*” and V € R”*” have entries independently 
and identically distributed (i.i.d.) according to the standard Gaussian distribu- 
tion. We generate the sparse matrix Sp € R™*” such that ||So||o * cpmn and 
its nonzero entries are i.i.d. from the Rademacher distribution. Finally, the data 
matrix M is computed as M = Lo + So. In the result tables, we explicitly indi- 
cate the value of ||So||o- 
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Comparative Algorithms. We compare our algorithm with the 4 following 
algorithms: 


— DCA-ADMM [9] is a DCA based algorithm for solving the following for- 
mulation of RPCA 


min rank(L) + Al|S|lo 
L,SER™*” 


(15) 
st ||M-—L—-S|f <e. 
Recall that in [9], Sun et al. replaced the ||.||o by capped /;-norm. In the same 
work, the authors also developed AL, an alternating algorithm to solve (15). 
— EALM and IALM are two version of augmented Lagrange multipliers 
(ALM) based algorithms, presented in [5], for solving the PCP formulation 
(4). EALM stands for Exact ALM while IALM is the inexact version of 
EALM. 


Comparative Criteria. To evaluate the performance of algorithms, we consider 
the following criteria ((L,S' is the obtained solution of comparative algorithms) 


— the rank of matrix L, 


— ||S|]o - the number of nonzero components of S, 

|Z—Lolle 

y “Wollr 

|S—Solle 
Sole ? 


— the error relative between L and Lp computed by 


— the error relative between S and So defined as 
— computation time. 


The experiments are performed on a Intel Core i5 3.60 GHz PC with 8 GB 
of RAM and the codes were written in MATLAB. The limited CPU time for the 
algorithms is set to 3h. 

In Table1, we report the comparative results with m = n = 500, r = 25 
and different values of sparsity ratio cp. We observe that for in all 4 settings, 
our algorithm CaPCA recovers the true rank of matrix [Lo while only [ALM 
successfully finds the true value of rank(Lo) for cy = 0.25. When cp increases, 
e.g., the error matrix Sp is less sparse, the other comparative algorithms (DCA- 
ADMM, AL, EALM and IALM) perform badly. Overall, CaPCA gives bet- 
ter results than the other 4 algorithms in all comparative criteria. Especially 
in terms of CPU time, CaPCA is by far the fastest algorithm comparing to 
DCA-ADMM, AL and EALM. 

In Tables2 and 3, we present the comparative results with (m,n,r) = 
(1000, 1000, 50) and (m,n,r) = (2000, 2000, 100). Similarly to the results in the 
Table 1, we can see that our algorithm CaPCA outperforms the others 4 algo- 
rithms. 
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Table 1. m =n = 500, r = rank(Lo) = 25 


I2—-Lolle 


IS—So lle 


lSollo Algorithm | rank(L) | ||Sllo Lolr Sate | Time(s) 
62646 (cp = 0.25) DCA-ADMM | 46 62454 2.4e—6 1.8e—5 48.2 
AL 25 63234 3.2e—6 1.2e—5 20.3 
EALM 58 64234 1.2e—6 1.4e—5) 14.1 
IALM 25 62646 1.0e—6 1.4e—5 4.7 
CaPCA textbf25 | 62646 6.4e—6 2.8e—5 1.0 
75195 (cp = 0.30) DCA-ADMM | 287 79512 2.6e—6 2.5e—5 209 
AL 55 89421 2.6e—6 2.5e—5 13 
EALM 113 82442 2.5e—6 2.4e—5 60.2 
IALM 299 144055 | 2.9e—3 2.6e—2 4.9 
CaPCA 25 75195 1.8e—6 2.4e—5 3.9 
87605 (cp = 0.35) | DCA-ADMM | 134 119973 | 2.le—4 2.4e—4 1273 
AL 98 104936 | 2.2e—4 2.8e—4 176 
EALM 227 129976 | 1.8e—5 1.5e-4 203.0 
IALM 297 150091 | 2.3e—2 1.9e—1 3.7 
CaPCA 25 87605 1.8e—5 1.2e—4 1.7 
100247 (cp = 0.40) | DCA-ADMM | 388 121716 | 2.le—2 4.2e—2 1593 
AL 124 134723 | 7.4e—2 3.7e—2 265 
EALM 311 164725 | 4.4e—2 3.5e—1 155 
IALM 289 160162 | 4.7e—2 3.7e-1 3.6 
CaPCA 25 100252 | 1.le—3 9.5e—-3 2.9 
Table 2. m =n = 1000, r = rank(Lo) = 50 
Solo Algorithm — | rank(Z) | ||$ilo Lolly | IF —*olle | Time(s) 
249920 (cp = 0.25) | DCA-ADMM | 93 204382 | 6.5e—7 2.9e—5 510 
AL 50 256323 | 4.5e—7 2.6e—5 294 
EALM 116 254487 | 9.6e—7 1.7e—5 113.8 
IALM 50 249920 | 8.5e—7 1.6e—5 34.0 
CaPCA 50 249920 | 5.4e—6 2.6e—5 10.2 
299047 (cp = 0.30) | DCA-ADMM | 172 3059222 | 3.6e—6 2.6e—5 1243 
AL 53 315234 | 2.9e—6 2.6e—5 682 
EALM 198 314445 1.3e—6 2.0e—5 327.1 
IALM 59 305231 1.6e—4 2.le—3 105.1 
CaPCA 50 299047 | 2.6e—6 2.6e—5 12.3 
349412 (cp = 0.35) | DCA-ADMM | 392 469608 | 5.4e—6 3.2e—4 5304 
AL 143 489328 | 3.9e—6 3.3e—4 1983 
EALM 520 568268 | 6.4e-6 7.7e—5 1045 
IALM 589 596800 | 1.5e—2 1.7e—1 38.8 
CaPCA 50 349412 | 7.9e-6 7.7e—5 15.8 
399835 (cp = 0.40) | DCA-ADMM | 763 469835 | 4.3e—4 3.5e—2 10932 
AL 211 499336 | 4.3e—4 2.3e—2 2421 
EALM 570 597514 | 3.le—2 3.5e-1 1376.4 
IALM 575 647112 | 3.3e—2 3.7e—1 37.4 
CaPCA 50 399835 | 4.4e—5 4.2e—4 21.5 
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Table 3. m =n = 2000, r = rank(Zo) = 100 


I|Sollo Algorithm | rank(Z£) | ||S|lo Lea tole | Ie Soll | Time(s) 
1001120 (cp = 0.25) | DCA-ADMM | 539 1051425 | 2.5e—06 | 2.5e—05 | 11800 
AL 100 1001120 | 6.le—O7 | 1.6e—05 | 2532 
EALM 232 1017204 | 2.8e—O7 | 7e—06 845 
IALM 100 1001120 | 2.1e—O07 | 1.9e—05 | 245 
CaPCA 100 «1001120 | 7e-07 1.5e—05 | 81 
1201715 (cp = 0.3) | DCA-ADMM | 314 1213012 | 4.3e—6 2.7e—5 11800 
AL 100 1213035 | 1.2e—6 2.5e—-5 11800 
EALM 406 1262762 | 1.3e—6 2.7e—5 2337 
IALM 101 1203035 | 1.3e—6 2.7e—5 301 
CaPCA 100 = 1201715 |4.7e-6 |3.6ce—5 | 97 
1399371 (cp = 0.35) DCA-ADMM | 1983 | 2438876 |3.2e—4 | 3.6e—3 | 11800 
AL 541 1569374 |7.2e-5 | 9.5e—5 | 11800 
EALM 1101 2639010 | 5.6e—6 9.5e—5 10174 
IALM 1175 2378633 | 0.01 0.17 303 
CaPCA 100 1399371 | 5.4e—6 7.4e—5 11 
1601671 (cp =0.4) | DCA-ADMM | 1987 2894311 | 0.09 1.21 11800 
AL 1784 2874512 | 0.09 0.98 11800 
EALM 1163 2434515 | 0.02 0.37 10438 
IALM 1151 2599415 | 0.02 0.37 321 
CaPCA 100 1601671 | le—5 le—4 154 


4 Conclusion 


We have studied the Principal Component Analysis (PCA), one of the most pop- 
ular methods of dimensionality reduction. In order to deal with highly corrupted 
noisy data, we have considered a variant of PCA, namely Robust PCA (RPCA). 
The RPCA can be formulated as an optimization problem which involves the 
minimization of €)-norm. The discontinuity of 29 makes the corresponding prob- 
lem hard to solve. Thus, we approximate the /)-norm by a continuous non-convex 
function, namely the capped ¢; norm. The resulting problem is a non-convex for 
which we developed a reweighted-f; based algorithm. The proposed algorithm 
consists in solving iteratively a non-convex sub-problem of two variables LE and 
S. Fortunately, if either L or S is fixed, the other can be computed explicitly. 

We have carefully conducted numerical experiments on several synthetic 
datasets. The numerical results showed that our algorithm CaPCA can recover 
exactly the rank of original matrix Lo in all instances. Moreover, the number 
of nonzero components of S given by our algorithm is very closed to the one 
of the original matrix So. Overall, CaPCA outperforms several state-of-the-art 
algorithm for RPCA, with respect to all comparative criteria. 
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Abstract. Lifelong machine learning has recently become a hot topic attracting 
the researchers all over the world by its effectiveness in dealing with current 
problem by exploiting the past knowledge. The combination of topic modeling 
on previous domain knowledge (such as topic modeling with Automatically 
generated Must-links and Cannot-links, which exploits must-link and cannot- 
link of two terms), and lifelong topic modeling (which employs the modeling of 
previous tasks) is widely used to produce better topics. This paper proposes a 
close domain metric based on probability to choose valuable knowledge learnt 
from the past to produce more associated topics on the current domain. This 
knowledge is, then, used to enrich features for multi-label classifier. Several 
experiments performed on review dataset of hotel show that the proposed 
approach leads to an improvement in performance over the baseline. 


Keywords: Close domain - Lifelong learning - Multi-label classification - 
Lifelong topic modeling 


1 Introduction 


Latent Dirichlet Allocation (LDA) [1, 2], and Probabilistic latent semantic analysis 
(pLSA) [12] are the two popular topic models for discovering the hidden topics in a 
text corpus following an unsupervised learning approach. In general, topic modeling 
assumes that, probabilistically, each document is a multinomial distribution over a 
fixed number of topics; and each topic has a multinomial distribution over all the 
observed words in the corpus. Therefore, the relationship between document-topic 
distribution and topic-word distribution are defined. These models typically need an 
enormous amount of data (thousands of documents) to describe meaningful statistics 
for extracting good quality topics. 

Recently, lifelong topic modeling (LTM), a type of lifelong unsupervised learning, 
has been widely used to exploit the prior domain knowledge to build the model 
inference to produce more sensible and reasonable topics [4-8]. The existing work 
exploits the previous knowledge in two forms including must-links (i.e., two words that 


© Springer Nature Switzerland AG 2020 
H. A. Le Thi et al. (Eds.): ICCSAMA 2019, AISC 1121, pp. 143-149, 2020. 
https://doi.org/10.1007/978-3-030-38364-0_13 


144 T.-N. Pham et al. 


often appear in the same topic), and cannot-links (i.e., two words that rarely appear in 
the same topic) in the previously generated topics from several domains (even a big 
number of domains called big data) to generate better topics in the current domain. This 
knowledge-based learning approach imitates the same way as human does, i.e., storing 
the learnt knowledge and using them to infer in the future. 

In [3], instead of collecting all the data from previous domains, they assume that the 
data from close domains may more effectively exploit the distributions of document- 
topic and topic-word thanks to the topic overlapping between the domains. In addition, 
the close domains may share the same features at different levels which are selected to 
improve the current topic learning. The LTM method proposed in [3] is a learning bias 
approach on domain level by identifying the close domains based on their vocabularies, 
top words, and topics. By focusing on exploiting prior knowledge of the close domains 
for enriching features of classifier, this approach showed little improvement in com- 
parison with traditional topic modeling with Automatically generated Must-links and 
Cannot-links (AMC) [5]. 

In many approaches, the distance is used to determine whether two objects are close 
or not. In this paper, we offer an approach to find close domains by exploiting the 
relationship among label and features, i.e., the probability for labeling documents based 
on the features on labeled domains. This is due to the fact that each label may be 
presented by specific features and certain features in many presentations may help to 
explore the already known concepts or similar ones. 

This paper has two main contributions, i.e., (i) proposing the close domain mea- 
sures based on features of probability; and (ii) performing an application of multi-label 
classification using proposed approaches. 

The remaining of this paper consists of following sections: We deliver some def- 
initions to identify close domains based on probability distribution in the next section. 
Section 3 describes a multi-label classification framework using proposed lifelong 
topic model. Our experiments and discussions on the results will be discussed in the 
Sect. 4. We also mention and present the differences in our proposed approach to some 
related work in Sect. 5. The sum-up and coming work will be shown in the last section. 


2 Definitions 


2.1 Problem Formulation 


Let D;, Do, ..., Dy be N datasets of of Ty previously tasks. Let S be the knowledge 
base, which includes all the knowledge and information from N previous tasks. S is 
empty when N = 0. Let Dy; be the dataset of current task Ty,;. The goal is to 
determine a set of previously datasets D.j.;e, which includes previous datasets D; closed 
with Dy,7;, then using the part of knowledge of S, which related with D,),,- for solving 
the current task Ty4;. 

Assume that there exists a general feature space F, in which the data from all 
domains can be represented. Assume that the sets of values of all features of F are 
discrete. Let sim(x, y) be a similarity measure of a data element pair x, y, e.g., cosine 
measure. Let X be a subset of elements, sim (x,X) be defined as the maximum similarity 
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between x and all the elements of X, ie., sim(x,X) = maxycy sim(x,y), this is also 
called the complete link similarity between two datasets. 


2.2 The Closeness in Probability of Two Datasets 


Assume that all the data of the i” previous dataset D; are labelled; and L; is the label 
set. Let 0; = (0:1, O02, - +5 Oi;1) be the probability threshold vector of the dataset D,, 
i= 1,2, ... N. The probability threshold vector 0; is a previous knowledge, which is 
determined based on the set of the posterior probability vector {(prob(/;;|x), prob(i2|x), 
.5 prob(liza |x))| x € Di}. 

Definition 1. Let x be an element belonging to the current dataset Dy, let 0; be the 
probability threshold vector of the previous dataset D;,i = 1,2, ..., N. D;is called close 
to x iff at least one posterior probability of x in D; is greater than or equal to the prede- 
termined probability threshold 0;;, i.e. dj €{1, 2, ..., | Z;|} such that prob(/,|x) > 0%. 


Definition 2 (closeness between to datasets). The i” previous dataset D; is called close 
in the probability to dataset Dy, iff 


|x € Dy+1 : D;is close — prob with x| 
|Dy +1| 


= prob ( 


_ 
Ww 


where 0), is a predefined threshold for deciding whether two datasets are close to 
each other or not. 


3 Proposed Model of Lifelong Topic Modeling Using Close 
Domain Knowledge for Multi-label Classification 


All stages in the multi-label classification framework are illustrated in Fig. |: 


— ——S$S$ 
Training Dataset Dy Testing Dataset Drest 


| 


Multi-Label Classifier 


Multi-Label Learning 


Fig. 1. A lifelong topic model using prior knowledge from close domains for multi-label 
classification framework 
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e Assume that, we already have a knowledge base of several previous domains, the 
current task including a training dataset and a testing dataset (in current domain). 

e Firstly, we find close domains to the current domain using the closeness in prob- 
ability approach. 

e After that, the close domains will be used to train AMC model to exploit the prior 
domain knowledge and to adjust the distribution hidden topic on the current 
domains. 

e Then, the knowledge of Hidden Topics derived from the close domains is trans- 
ferred to new features for representing the texts. These features are considered better 
than those extracted from all previous domains. 

e Finally, a multi-label classifier will be built for new documents in current domain 
Dy. We use different classifiers to take a careful look at the improvement in 
performance of system when applying the proposed approach. 


4 Experiments and Discussion 


4.1 The Datasets 


In the experiments, we focus on the impact of different domains on the model espe- 
cially when the current domain has few data (small training dataset). We used a multi- 
label dataset of about 1350 reviews on hotels with a associated set of 5 labels including 
Place and cost, Services, Equipment, Room standard, and Food. We divided the 
original dataset into 5 sub-datasets named D,, D2, D3 (the three previous domains with 
400 reviews of each dataset), D4 (the current domain which is set up with different size 
of 20, 30, 50, 80, 100 and 150 reviews) and Dyes, (dataset for testing with 100 reviews). 


4.2 Experimental Scenarios 
Different configurations were set up in experiments below: 


e The Term Frequency (TF) feature is used for data presentation. 

e The number topics for the AMC model is set to 10, 15 and 25. 

e For multi-label classifier, we use Binary Relevant method with core algorithms of 
Naive Bayes, k-Nearest Neighbor (KNN), Gaussian Process, and Random Forest. 

e The threshold 0,,,, to decide the close domain is set to 0.1, 0.3 and 0.7. In oder to 
reduce the result, the evarage of the results of the experiments with three different 
values of Op,op is calculated. 


We took the effectiveness of system into considerations with the popular measures, 
i.e., precision, recall and Fl. We perform two groups of experiments with different 
settings on each dataset of D,, D2, D3, D4 and Dies, as follows: 


e The baseline (denoted by the OF): run multi-classifiers on the original features of 
TF without using any previous domain knowledge. 

e Our approach: run multi-classifiers on the original features with the knowledge from 
the close domains with different number of AMC topics. 
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4.3 Discussions on Results of Experiments 


The proposed approach aims to find close domains to the current domain. Then, the 
found close domains are taken into AMC model to build topic modelling features for 
enriching features of the current domain. We show results of experiments based on the 
size of D4 and number of topics for AMC model in Table |. In comparison with the 
baseline, the results with better performance are highlighted with yellow and the best 
results are formatted in bold. 


Table 1. Results of multi-label text classification using proposed approach in finding close 
domain enriching features for the classifiers 


Se Process a Bayes Random Forest 


D4 earl 
oO 8.70% — oe i 59.33% 86.65%| 7. 14.30% 
29 [20 topics %| 6.49% 55.88%| 37.01% 64.94% 56.72% k 
15 topics | 51.81%] 27.92% 55.06%] _31.82%| 40.33% 62.34%] 58.54% 
2.60%| 5.06%] 65.22% _29.22%| 40.36% ; 62.34% 
57.79%] _57.79%| _$7.79% : 60.84% 20.37% 
30 72.13%] 28.57% 50.38%] 42.86%| 46.32% 
80.60%| 35.06% 55.63%| 54.55%| 55.08% 
91.38%] 34.42% : 56.49% 
35.06%] 50.94%] 66.25%] 34.42% 45.30%] 58. 66.23%] 62.39%] 82.93%] _6. 12.35% 
50 |_20topies | 93.22%] 35.71% : 33.12%| 44.54%| 58.96%] 66.23%] 62.39%] _83.33%| 12.99% 
15 topics | 88.14%] 33.77%] _48. ; ; 58.96%| 66.23%| 62. 71.43%| 6.49% _ 11.90% 
25 topics k f 71.65%] 59. 58.96%] 66.23%] 62. . 5.84%] 10.47% 
57.23%| 59. . 56.83%] 67.53%| 61. ; 3.13% 
go [10 topics E f ; ; . ; 56.83%| _67.53%| 61. i 1.95% 
15 topics r i J ; : 56.83%| _67.53%| _ 61. 86.67% 8.44% 
63.70%| _ 55. 56.83%| 67.53% 71.43% 
53.61%| 57. 55.63% 
100 48.95%] 60.39%| 54.07% 
61.49%| 64.29% 
61.90%] _59.09%| 60.47% 
150 a a 59.09%] 59.09%| 59.09% 
15 topics f 50.65%] 65.55%] 58.43%] 62.99% i 67.53% 
25 topics F 52.60%| 66.94%] 62.50%| 61. F 67.53% = 77% 10.00% 


The results show that the proposed model leads to a very promising performance in 
lots of experiments in comparison with the baseline. The classifications using kKNN and 
Random Forest algorithms provide a higher performance in all groups of experiments 
(may be in different configuration of topic number). Especially when the current 
training dataset (D4) is very small (20 and 30 reviews). In other words, the previous 
knowledge from close domains may take reasonable contribution to the text classifi- 
cation of supervised algorithms. 

When the current training dataset (D4) is increased (to 100 and 150 reviews), the 
classification results are not improved or even lower than baseline. This may due to the 
fact that the larger the dataset (D4) is, the more useful features it gets and these useful 
features may conflict with the features from close domains that leads to the decrease 
performance of the model. 
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5 Related Works 


In lifelong learning, the past knowledge will be used to enhance the performance of the 
current task. However, in most lifelong learning models, which data domains of the 
past knowledge should be used, are still an open problem. In many applications, a 
model which is effective in one domain may be transformed to another domain (i.e., 
cross domain learning problem). This approach leads to lots of advantages in saving 
time and cost consumption for manually labeling new data. It is clear that the quality of 
past knowledge or known domains will have an impact on the current model performed 
on new domain (dataset). 

To our best knowledge, there has not been much research about selecting the close 
data domains to the current dataset for better enhancing the learning performance. In 
[3], the similarity between two domains is defined based on the measure of weighted 
word bags, and topics (derived from the probability distribution of hidden topics in 
domain) without mining the label set of training data. The close domains found in [3] 
were used to enrich the knowledge base of lifelong topic modeling (LTM) [4], which 
improves the hidden topic features of the current task. Our proposed approach exploits 
the probability features of both data and label set to form the closeness between two 
domains. And then the close domains will become input of the AMC model to extract 
high quality features of hidden topics for the current task. 

Probabilistic viewpoint plays an important role in many areas of scientific research. 
In a probabilistic model, the relationship among the observables is formed to describe 
the fundamental possibility based on their behaviors even they are not assumed to hold 
exactly for each observation. Therefore, in statistical estimation problems, it is sig- 
nificant to find out and evaluate a close probability distribution. In [9, 11] the closeness 
between two probability distribution is defined based on an information measure and 
the criterion of maximum entropy. In [10], the measures between two probability 
distributions were used including Hellinger coefficient, Chernoff coefficient and Jef- 
freys distance, J-divergence, etc. In our research, we define the distance of two different 
domain datasets via the relationship between an element and a dataset which is 
determined by comparing the posterior probability distribution of the element in the 
dataset to a predefined threshold. This is an extended definition of the closeness 
between two probability distributions to present the relationship between two domain 
datasets. 


6 Conclusions 


This paper provides a multi-label classification framework using lifelong learning 
technique to use the previous knowledge of the close domains. This paper makes the 
contribution in term of the proposal of a close domain measure to determine the 
closeness of two different domain datasets based on their aspect of probabilities. This 
measure is used for selecting only the datasets that are deemed to close to the current 
dataset for enhancing the current task’s learning. The results of our experiments 
demonstrate the reasonable improvement of the supervised classification using the 
proposed approach, especially in case of having a small labeled dataset. 
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This work will be upgraded in the future in several ways. Firstly, the threshold 


parameters should be chosen accounting for features of the datasets themselves instead 
of fixing them. Secondly, more experiments should be performed (especially in other 
datasets) to get more evaluation for the proposed approaches. Finally, the technique of 
mining close domain should be improved in AMC model to get more effectiveness. 
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Abstract. Imbalanced data is recognized as one of the most attractive matters to 
many researches. It is shown by numerous publications on this which is a grow- 
ing interest. The hardest challenge is the failure of generalizing inductive rules 
by learning algorithms. such as difficulty in forming good classification on deci- 
sion boundary over more features but fewer samples and risk of overfitting of the 
sampling. So many solutions have been applied to deal with these problems. In 
our article, we propose a novel method called MASI (Moving to Adaptive Sam- 
ples in Imbalanced) in term of changing majority class samples’ label into minor 
class samples based on data distribution. This proposed method rebalances the 
classes before training a model in order to improve the classification performance 
in imbalanced data. We tested on some unbalanced datasets from data of UCI. 
The empirical results showed that our method has a significant achievement in 
Sensitivity and G-mean values than other classification models, such as Random 
Over-sampling, Random Under-sampling, SMOTE, and Borderline SMOTE in 
using different machine learning approaches, including SVM, C5.0, and RF. 


Keywords: Classification performance - Fraud detection - Imbalanced data 


1 Introduction 


In recent years, fraud detection is one of the most interesting topics. The imbalanced class 
problem being difficult challenge faced by machine learning has received considerable 
attention of a significant amount of research, especially in the fraud detection domain 
[1-6]. It also has been pointed out as one of the 10 most challenging problems in the 
domain of data mining research [7, 35]. Balanced data sets are said to be balanced if 
there are, approximately, as many positive examples of the concept as there are negative 
ones. Otherwise, data imbalance often occurs when there significantly differs in the 
number of samples between classes. It means that the number of examples representing 
a class is much larger than other classes [8]. The class that occupies the samples in the 
major class is called the negative class, whereas the class with few examples is called 
the positive class. Data imbalance being a popular issue in the classification appears in 
various realistic areas, such as Fraud Detection, Medical Diagnosis, Network Intrusion 
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Detection, Detection of oil spills from radar images of the ocean surface, etc. As an 
illustration, the distribution of unbalanced class distribution known as samples in the 
major class significantly be more than the number of samples in the minor class [9]. 
The primary issue in imbalanced class is that popular classification models are often 
biased to the major class It leads to increase accurately classification of the majority 
class samples, whereas many samples in the minority class are incorrectly classified. 

In some real applications, the misclassified samples minority class leads to serious 
consequences. Detecting frauds seriously affects business organizations. Take telecom- 
munication fraud in the United States as an example, it cost millions of dollars per year 
[10]. In order to detect fraudulent transactions early, people often analyze the infor- 
mation in the existing transaction database, thereby discovering unusual transactions 
early. However, the number of none-fraudulent transactions is significantly higher than 
that of fraudulent cases. False prediction of frauds can cause huge economic losses. In 
medicine, the clinical database contains a large amount of patients’ information s and 
their pathology. The data mining algorithms used in these databases tend to explore 
relationships, patterns in clinical and pathological databases to predict the progress and 
characteristics of some diseases. From there, it is possible to predict a person who is ill 
or not [10]. However, the number of patients is usually much smaller than those who 
are not sick. If a patient is misdiagnosed not being ill, there will be no timely treatment. 
Moreover, the combination of various classifiers is considered a typical idea to improve 
performance. The biggest problem is that how to choose correctly the classifiers in the 
myriad of diverse classifiers [4]. 

Dealing with imbalanced dataset, we should consider classifiers with adjusting 
the output threshold instead of using standard machine learning algorithms. This arti- 
cle presents a new method called MASI being an integrated method is combined of 
ADASYN and SPY methods in term of changing the majority class samples’ label 
into minor class samples’ label based on data distribution. Basing on this approach, 
our method rebalances the classes before training a model. As a result, classifiers’ per- 
formance in imbalanced data can be improved. This article is organized as following: 
Part 2 continues with some information on related work; Methodology is presented 
in part 3; Part 4 analyzes the experimental results and compares to other classification 
approaches, including Random Over-sampling (ROS), Random Under-sampling (RUS), 
SMOTE, Borderline SMOTE, SPY, and finally part 5 ends up with conclusions. 


2 Related Work 


The imbalanced of data is a common subject in data mining and machine learning. 
Some approaches such as decision tree, SVM, K-nearest, Naive Bayes, etc. have been 
developed and successfully applied in many areas. However, these approaches face some 
difficulties with some imbalanced datasets. 


2.1 The Imbalanced Data in the Fraud Detection 


In fraud detection, there are myriad of datasets used. Although these datasets differ in 
their sizes, the types of fraud and the numbers of fraudulent cases, they also have a 
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common feature, like the number of fraudulent cases accounting for a very small ratio 
compared to non-fraudulent cases. In other words, the datasets often lose their balance. 
In the field of fraud detection, the data imbalance means there is a big gap of quantity 
between classes. For example, in binary classification, the given two-class datasets have 
much less representatives of one class (minor class) than of another (major class). 

For a bi-class issue, the grade of imbalance class distribution can be represented by 
the proportion of the size of minor class samples for that of the major class. To give 
practical application as a clear example, 1:100, 1:1000, or even larger can be showed 
as drastic ratio [10]. The class that occupies the major examples is called the negative 
class, whereas the class with few examples is called the positive class. In classification 
algorithm, the major class usually has been achieved high accuracy. On the contrary, the 
accuracy in minor class is tend to be low. 

In financial fraud detection, according to [1 1—14], these studies reviewed that finan- 
cial frauds can be divided into three main areas: internal, insurance and credit However, 
internal frauds can break into two sub-categories, namely financial statement fraud and 
transaction fraud. Financial statement fraud has been known as management fraud while 
transaction fraud captures the process of snatching organizational assets. In addition, [9] 
presented the data imbalance in biomedical data. They showed that some methods have 
been achieved to rebalance the data imbalance, but this issue has still not been solved 
completely, such as reducing the classification performance. Some imbalanced datasets 
can be seen in Table 1. 

It can clearly be seen in Fig. 1, the data imbalance exists in both financial domain 
and biomedical domain. The first set of data has a highest rate of imbalance ~1:20 with 
23 transitional fraudulent cases in total 500 transactions, etc. 


2.2 The Approaches for Imbalanced Data Classification 


To deal with the imbalanced dataset, the researches have used many techniques such as 
supervised learning, unsupervised learning, etc. The techniques have been devided into 
two groups, including the data and algorithmic levels [3]. 


Data - Level Approach 

At the data-level, the imbalanced strategies are that the techniques are used to adjust the 
data distribution by rebalancing data rate, or removing the noise between the minor class 
and major class before applying any algorithm. To take an example, this method can 


Table 1. Some imbalanced datasets. 


Data Total records | Cheat | Legal 

UCSD-FICO 500 23 477 
Carclaim 15,420 923 | 14,497 
Yeast 1484 163 1321 
Haberman 306 81 225 
German credit data 1,000 300 700 
Australian credit approval 690 307 383 
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increase the quantity of minor class, reduce the number of majority class, or combine 
many techniques. One of the advantages of this approach is flexibility. It means the 
transformed data can be used to train with different classifiers [5, 10, 15]. Data-level 
method can be grouped into two categories, including over-sampling methods and under- 
sampling methods [16]. 


Over-Sampling Methods 

Over-sampling methods create a dataset which is larger than initial dataset [8]. Over- 
sampling consists upsizing the minority class and decreasing the rate of the unbalanced 
data by randomized samples, selected samples or added synthetic samples. This method 
may increase the overfitting and biasing problems because of replicating the minor class 
until creating an equal frequency rate between two classes [5, 17, 18]. 

It is simple to understand both methods, including ROS, and RUS. ROS idea ran- 
domly selects the set S from the samples in minority class, then replicate the samples 
selected and add them to samples in minority class. Otherwise, the set S instances is 
randomly chosen in major class to remove these S instances from the prototype dataset 
[19] (Fig. 2). 

Efficacious approaches, such as SMOTE, has a positive impact on diverse applica- 
tions. The main idea of this algorithm is that SMOTE oversamples the minority class 
by generating synthetic samples based on the feature space, rather than data space [20]. 
These artificial examples created along the line segments joining a portion or all of the k- 
nearest neighborhood defined as the k-elements of observed minority class and based on: 
(a) Euclidian distance between (b) Generating a new sample, one of the K-nearest neigh- 
bors randomly is selected before finding the ratio between the number of the selected 
sample and its nearest neighbor. (c) Multiply this ratio by a number generated uniformly 
from 0 to 1. (d) Create a vector based on the selected samples relabeled into the minority 
class. SMOTE will alter the original class distribution in order to enhance the number 
of minor samples correctly classified. 

After using SMOTE, although there are no false negatives in the dataset, it still 
exists a larger number of false positives. In imbalanced classification, almost classifiers 
expected that there are many samples to correctly classify all minor samples by reducing 
the number of false negatives. It is easy to understand because missing positive samples’ 
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Fig. 1. The imbalance of some datasets 
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Fig. 2. The basic technique of over-sampling a minority class by replication its own samples 
Synthetic Minority Over-Sampling Technique (SMOTE) 


cost (a false negative) outweighs that of negative sample (a false positive). The idea of 
SMOTE can be illustrated in Fig. 3. 
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Fig. 3. An example of SMOTE technique 


The figure shows the artificial examples are generated along the line segment between 
green triangle examples. These examples can rebalance the original class distribution 
and gradually enhances learning. But there is biggest problem in SMOTE method is that 
it still has over generalization of minority class space. 


Under-Sampling Methods 

Under-sampling method generates a subset of the original dataset by reducing the number 
of majority classes [8]. It is the simplest way to remove the samples randomly. In the 
fraud detection problem, this method selects randomly the samples representing the 
non-fraudulent cases and removes them from the original dataset as following (Fig. 4). 


| 
war + | @ Normal 
on 
+ > + | ® Fraud 
> 
a) Original Data b) Data after Random Under-sampling's applying 


Fig. 4. Demonstrations of the under-sampling’s applying 


Although, this method can reduce the data imbalance, randomly removing samples 
could loss the important information which is useful for building model [8]. 
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Algorithmic - Level Approaches 

Another approach is algorithm-level approaches. At this level, the main purpose of 
classifiers is tuned to enhance learning the features of the minority classes by adjusting 
the error cost and misclassified samples of minorities (fraud cases classified as non- 
fraudulent) which mean it has higher costs than the misclassification of majority class 
(usual cases are classified as fraud). 

For instance, decision tree classifier such as C5.0 use Information Gain as splitting 
criteria. The modification would be implemented by changing predicted probability at 
leaf nodes or development new pruning methods. With the support vector machine 
(SVM) method, adjustment is done by adding different finite constants to different 
classes, or setting class borderlines based on the idea of a kernel link. In tier-learning 
approach, the model is constructed only with samples of the target class. This method 
does not attempt to find the boundary that distinguishes the majority and minority classes 
but try to find the boundary that surrounds the target class. For this target class, elements 
that need to classified will be measured its similarity with the target class’ elements 
using threshold boundary between two classes. If this threshold is too small then the 
large number of majority samples will be filtered, otherwise it will be kept. Therefore, 
setting an effective threshold is very important for tier-learning approach. 


3 Methodology 


3.1 SPY and ADASYN Methods 


ADASYN 

Haibo et all proposed a novel adaptive learning algorithm from the extension of SMOTE 
called ADASYN [19]. ADASYN tend to mitigate imbalanced class issue in a dataset by 
generating the number of minority instances based on the amount of its majority nearest 
neighbor [20]. The number of synthetic data samples is determined by a parameter called 
B used to rebalance data level after synthetic process. In ADASYN, density distribution 
is used as a criterion to estimate the number of artificial sample hass while each minority 
sample has equally likely chance to be selected for artificial process in SMOTE. However, 
ADASYN does not notice noise samples. It is possible to generate a large number of 
synthetic data around those examples, which may create an unrealistic minority space 
for the learner. 


SPY 

The samples tending to be more mistakenly classified than the others far from the bor- 
derline is located on the borderline and nearby ones. Based on this matter, Dang et all 
[9] introduced a new approach called SPY to adjust the balance data ratio by changing 
the majority class’ labels in the nearest k-neighbor into the minority class labels [9]. 
By doing so the number of the major instances decreases and the number of the minor 
class samples increases. When using SPY method, this majority class samples is called 
“SPY”. 
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The advantage of SPY method is that it determines the vital importance of borderline 
samples and focuses on the strength of minority boundary samples to increase classifi- 
cation performance. Moreover, SPY method adjusts data distribution without changing 
data size. On the one hand, in some cases, the number of samples selected for relabeling 
does not correspond to the distribution needs of each specific data. Therefore, SPY does 
not improve classification efficiency, even reduces accuracy in these cases. 


3.2. MASI Algorithm 


Idea 

The proposed algorithm is an improvement of ADASYN [19] and SPY [9]. Based on data 
density distribution, MASI will relabel the majority class into minority class samples. 
The number of rechanged labels is affected by two factors. Firstly, the ratio between the 
number of majority class and minority class influences on the minority class relabeled. 
In addition, for each minority class, the nearest neighbor samples are chosen to change 
labels differently. This depends on the proportion of the number of nearest neighbor 
between two classes: minor and major class. In other words, the number of the nearest 
neighbor of its majority class changed label will be proportional to this rate (Fig. 5). 


a) Prototype dataset b) The dataset after using MASI 


Fig. 5. Data representation used MASI (before and after) 
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Algorithm MASI (T, B, k). 


Input: Training dataset ay based on original set 
S={(x,,y,)y, € EL is the sample’s label, x,eER" where ie (ln, k 


is the number of nearest neighbors, and B is a parameter to 


stimate the desired equilibrium ratio. 
Output: New training T’ 
1. Select randomly 90% instances to train, the rest for val- 
idation set as follows: T=90%xS, V=S-T 
ny = majority(T), n,,.= minority (T) 
2. Calculate the synthetic data samples in majority need to 
rechange their label to minority class samples as fol- 


lows: G=(Nmaj — min) * B (1) 


3. For 2:=1 to mde 
In each data sample of p,. 
oFind its nearest neighbors in the entire training da- 
taset 
oCompute the proportion of the nearest neighbor in the 
majority class (x;) to k nearest neighbor. This rate 
called r,7=" where r,€[0,1] (2) 


4. Calculate the density distribution of data, called r,,: 
1 ri 


(3) 


Where inj =1 
5. For i:=1 to n, do 
In each data sample of p,, 
oCompute the number of nearest neighbors gj; to be labeled 
around each minority by the following formula:g; = "% *G (4) 


Where G is taken in equation (1). 


Evaluation Measures 

A confusion matrix (contingency table) is a typical evaluation measure to represent 
classification performance [21]. As an illustration, the minor class and major class called 
positive and negative, respectively. Table 2 shows this matrix of bi-class. 


Table 2. A confusion matrix 


Forecasted as positive Forecasted as negative 


Genuine positive | TP | FN 


Genuine negative | FP | TN 
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In this matrix, the rows and the columns are genuine class labels and forecasted 
class labels, respectively. TN (True Negative) is the number of class samples that are 
mostly accurately classified; FN (False Negative): is the number of minor class samples 
mistakenly classified as major class samples; TP (True Positive): is the number of minor 
class samples to be correctly classified; FP (False Positive): the number of major class 
samples that are mistakenly classified as minor class samples. 

For balanced datasets, the efficiency of a classification model can be evaluated by 
accuracy (or error rate) which is the ratio of the number correctly classified samples. It 
is defined as: 


G —mean=~VSEx SP (5) 


According to [9, 17, 22], there are two values being the SE (sensitivity) and SP 
(specificity) to measure the performance of classification. The correct forecasted pos- 
itive samples felt to minority class is called SE value while the ratio of forecasted 
correctly negative samples is called SP. In imbalanced data classification, the using k- 
fold Cross-Validation (k = 10) is to validate the classifiers’ performance. In our paper, 
G-mean is used as a measurement to evaluate the balance between SE and SP. More- 
over, we averaged every G-mean out at 20 G-mean values taken from every k-fold 
Cross-validation. 


4 Experimental Results 


4.1 Datasets 


We tested four imbalanced datasets which are divided into two types in UCI Machine 
learning respiratory, including Financial datasets and biomedical datasets. To give a clear 
example, Table 3 shows the rate of imbalanced data. 


Table 3. The detailed imbalanced datasets from UCI 


Dataset Samples | Attributes | Ratio minor/major 
German credit card 1,000 20 1:2.33 
UCSD-FICO 500 19 1:20.74 
Haberman 306 3 1:2.28 

Yeast 1,484 8 1:28.1 


4.2 Results 


We experimented on four imbalanced datasets with MASI, and compared to five alter- 
native adjustment algorithms, including ROS, RUS, SMOTE, Borderline SMOTE1 
(BSO1), SPY, and proposed MASI. The classification models using artificial samples are 
SVM, C5.0, and RF. These algorithms are determined as suitable resampling methods 
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for selected classifiers SVM, C5.0 and RF. In addition, SE, SP, and G-mean are used as 
criteria to assess classification performance between these approaches. This study also 
used k-fold Cross-validation method in testing to reduce the limitation of small dataset, 
k was chosen at 10. The Table 4 showed experiment of all classifiers. 


Table 4. Classifiers’ performance 


Data SVM C5.0 RF 

SE SP G-mean | SE SP G-mean SE SP G-mean 
German Original 39.75 | 91.53 60.3 | 47.62|83.7 163.1 41.9 91.76 | 62 
creditdata ROg 67.72 | 74.56/ 71.05 | 52.62| 76.87| 63.58 50.82 87.38 | 66.63 
RUS 73.48 | 68.92|71.16 | 65.57 | 65.74| 65.63 | 73.37 | 69.38 | 71.34 
SMOTE. 69.83 | 71.49 | 70.65 | 56.05 | 73.14 | 64 56.65 82.24 | 68.25 
BSOL 65.55 | 75.64| 70.41 | 50.32 | 80.65 | 63.68 48.38. 87.8 | 65.17 
SPY | 70.23 |72.26|71.22 | 63.13 | 69.53 |66.24 | 71.02 70.61 | 70.81 
MASI 70.08 | 73.67| 71.85 | 60.35 | 72.81|66.27 | 69.82 73.11 | 71.44 
UCSD Original 0 =— | 100 | 0 0.65 |99.61/3.12 23.7 99.62 | 48.37 
FICO ROS.) ~—_56.3._: | 91.57/ 71.73 |46.74|96.21/ 66.9 35.43 | 99.34 | 59.27 
RUS 71.3: | 62.53 66.65 | 68.04 59.95|63.72 80 68.5 | 73.96 
SMOTE. 49.78 | 95.07 | 68.73 |31.3 |95.34|54.45 32.17 98.85 | 56.29 
BSOL 48.26 | 95.21| 67.6 | 37.39 | 94.62) 59.02 32.83 98.95 | 56.86 
sPY | 73.26 |73.44| 73.31 | 73.04] 75.31|74.07 | 68.7 | 79.77| 73.97 
MASI 73.26 | 76.35 | 74.76 |'73.26/79.01| 76.04 79.78 70.46 | 74.95 
Haberman Original 18.70 | 92.93 41.62 | 17.28 | 91.36/ 39.14 23.46 89.91 | 45.85 
ROS 48.70 | 72.04 59.19 | 55.86 | 67.18|61.14 | 38.52 78.49 | 54.95 
RUS 54.51 | 70.27| 61.84 | 55.12) 72.04| 62.97 | 58.27 68.69 | 63.22 
SMOTE | 67.22 | 57.24 62.01 | 54.63 77.42|65.00 | 44.07 | 77.00 | 58.22 
BSOL 63.40 | 59.89 |61.61 | 61.67| 68.78 | 65.09 38.27. 81.93 | 55.93 
SPY 55.86 | 76.67 | 65.40 | 60.09 | 66.00 63.30 58.89 74.13 | 66.02 
MASI 60.99 | 70.07 | 65.34 | 57.28 | 74.42| 65.26 60.99. 70.07 | 65.34 
Yeast Original 3.73 | 99.98 19.55 | 26.76 | 99.14 51.33 14.12 99.73 | 37.39 
ROS 62.45 | 90.69 | 75.22 | 41.76 | 96.65 63.45 31.08 98.93 | 55.41 
RUS 62.06 | 94.69 | 76.63 | 65.98 93.42} 78.48 51.47 96.11 | 70.29 
SMOTE | 58.73 |93.6 |74.1 | 62.06 | 92.51|75.73. 54.8 | 95.99 |72.51 
BSOI 42.45 |97.6 | 64.34 | 30.39|99.04|54.78 24.41 99.06 | 49.1 
SPY 70.59 | 92.92 80.98 |70.49/92.19|80.6 | 68.73 | 92.11 | 79.54 
MASI 85.39 | 81.7 | 83.52 | 90.39 73.76 | 81.63 $4.02 83.05 | $3.52 
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It can be seen clearly that almost all datasets used MASI method have significant 
achievement in G-mean comparing to other methods. Especially, the proposal method 
had an outstanding performance in Yeast dataset regarding G-mean rate, it is much 
better than well-known SPY method (83.52 versus 80.98; 66.27 versus 63.1; 81.63 versus 
80.60, and 83.52 versus 79.54 respectively). In UCSD dataset, testing classifiers failed to 
identify fraud in original dataset, while its results improved after resampling. Moreover, 
MASI still presented the highest G-mean in all classifiers. This can be explained that 
both datasets (UCSD and Yeast) have high ratio in imbalanced of 1:20.74 and 1:28.1 
respectively (see in Table 2). In three classifiers, C5.0 seems to have the best results 
compared to SVM, and RF. 

Overall, MASI method illustrates the highest G-mean in almost imbalanced dataset 
in all classifiers. Especially, G-mean is highest value in three datasets (German, UCSD, 
and Yeast) in three classification algorithms and G-mean has approximately result in 
Haberman when using MASI. 


5 Conclusion 


The data imbalance issue is becoming common in many practical applications and has 
attracted great interest from researchers. Numerous different methods try to solve this 
problem to enhance the classification performance. In addition, minority boundary sam- 
ples are misclassified more than other samples. Therefore, we propose MASI method 
to increase minority borderline samples based on data distribution by changing their 
label. We also implemented and experimented to compare this MASI with other meth- 
ods. Our results showed that the proposed method outperform other methods like ROS, 
RUS, SMOTE, Borderline SMOTE, SPY, and MASI using classifiers SVM, C5.0, and 
RF. Although our method has significant results, there are still some shortcomings in 
our method such as algorithm complexity, imbalanced regression, etc. Interesting idea 
is that a combination between MASI and the previous methods, for example, SPY, or 
SMOTE, is new idea to address these challenges in the future. 
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Abstract. In this paper we present a hybridization of Rough Set (RS) 
theory and Support Vector Machine (SVM). Both approaches to data 
analysis employ the area between positively and negatively labeled exam- 
ples, i.e. the “boundary region” in RS and the “margin” in SVM, but 
they offer different ways to use this concept in the classification problem. 
We will show that despite differences, many Rough Set methods can be 
also implemented by SVM. In particular we will show that the rough set 
methodology to discretization problem can be also solved by SVM witha 
special Boolean kernel. At the end we propose a compound classification 
method that aggregates the feature selection method in RS and object 
selection method in SVM. 


Keywords: Classification problems - Rough sets - Support vector 
machine - Boolean kernels - Hybrid systems 


1 Introduction 


In machine learning, classification is the problem of construction of the algorithm 
that can categorize the new unseen data objects into some predefined classes. 
Such algorithms are called the classification algorithms or shortly the classifiers. 
It is a typical supervised learning task, because the classifiers are constructed 
(learned) from labeled training data sets. There are at least three ways to define 
the partition of the instance space, i.e. (1) using a Logical expression, (2) Using 
the Geometry of the instance space and (3) using Probability. Thus the existing 
classification methods can be categorized into Logical, Geometrical and Proba- 
bilistic approaches to Machine Learning [6]. 

In this paper we present some comparison analysis between the two tech- 
niques for classifier construction offered by rough set theory (also called Rough 
Sets or briefly RS) and Support Vector Machine (SVM). The rough classifiers 
are rather Logical approach, while the second are typical Geometrical approach 
to Machine Learning. The choice of those classification techniques was dictated 
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by the fact that both Rough sets and Support Vector Machine pay a deep atten- 
tion on the border area between decision classes. However this concept (called 
“boundary region” in rough set theory and “margin” in SVM) is used in different 
ways in order to form the final classifiers. 

Rough set methodology to classification is looking for minimal subsets of 
features that maintain as much as possible discriminant power as the set of 
all features. Usually this idea leads to creating the methods that minimize the 
boundary region between the decision classes. Many existing methods in Rough 
Sets perform the Minimal Description Length (MDL) principle and hierarchical 
concept approximation idea. Boolean Reasoning methodology is also a popular 
technique in rough set theory [14]. The problem of learning rough classifier from 
training can be decomposed into simpler tasks like searching for minimal reducts, 
for minimal decision reducts or for a set of irreducible decision rules. 

Rough Sets and Support Vector Machine are realized by different compu- 
tation techniques. Support Vector Machine(SVM) follows the maximal margin 
principle. SVM approach to classification is founded on two mathematical tricks. 
The first trick applies the Lagrange multiplier strategy to find linear classifiers 
with the maximal margin. This method transforms the original problem into a 
corresponding Quadratic Optimization. The second trick, also call the Kernel 
trick, implements an embedding of the input space in a very large feature space 
without increasing the time and memory complexity. Moreover, searching for 
reducts in Rough Sets can be interpreted as a feature selection method, while 
searching for support vectors in SVM is a method for object reduction. 

There are few ideas to combine RS with SVM. In [3,5,9] the authors pro- 
posed to use attribute reduction method from RS as a pre-processing step for 
SVM. The other researchers proposed to apply two SVM models to present the 
lower and upper approximations of a concept [1, 10-12, 19]. In [15], we presented 
another approach to combine SVM and RS. We recalled a the concept of Boolean 
kernels that was designed for learning the monomial Boolean formulas and show 
that the fundamental rough set techniques like searching for minimal reducts, 
extraction of decision rules or construction of rule based classification algorithms 
can be also implemented by SVM with different Boolean kernels. However, the 
proposed method was designed for decision tables with symbolic values only. 
This paper extends the previous results by showing that the methods proposed 
in [15] can be modified so that they become applicable for data sets with real 
value attributes. We present a special kernel function that realizes the discretiza- 
tion method based on Boolean reasoning and Rough set theory. We also propose 
a hybrid method that combines the feature selection technique in RS and object 
selection technique in SVM. 

The paper is organized as follows. Section 2 contains some basic notions of 
RS, SVM, Boolean kernels and revises the main results in [15]. In Sects. 3 and 
4 we present the SVM with Boolean kernels that implements the discretization 
problem in Rough set theory as well as a hybrid method that combines SVM 
with Boolean kernel and Rough Set method for classification. The paper ends 
with conclusions and future plan in Sect. 5. 
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2 Preliminaries 


Rough Set theory has been proposed by Z. Pawlak in [16] as a data driven method 
for the concept approximation problem under vague, imprecise, inconsistent, 
incomplete information and knowledge. In Rough Set Theory the uncertainty and 
imprecision is expressed by mean of the boundary region of a set. Formally, in the 
simplest case, when we have a finite set of objects, and each object is described 
by a finite set of attributes, the information system is a tuple S = (U, A), where 
U is a set of objects, A is a set of attributes, i.e. the set of functions of form 
a: U — Va, where V, is the domain of attribute a € A. For any subset of 
attributes B C A we define the B-indiscernibility relation and denote it by 
INDg(B)) as follows 


INDs(B) = {(a,2;) € U x U : a(z;) = a(2;) for all a€ B}. 


A pair of different objects satisfies the indiscernibility relation if we are not able 
to discern those objects using only available information about them. 

The training data sets for classification problem are the special case of 
information system called the decision table because it contains some special 
attributes that defines the partition of objects into decision classes. Formally, 
decision table is a tuple S = (U,AU {dec}), where dec: U — Vaec is called 
decision attribute. In cases of binary classification problem, Va.. = {—1, 1}. 

Feature selection refers to the methods that look for subsets of features 
(attributes) that are most relevant for the classification task, or that eliminate 
the features with little or no predictive information. Feature selection can signif- 
icantly improve the quality of the classification models and often build classifiers 
that generalizes better to the new unseen objects. The feature selection problem 
in Rough Set Theory is defined in terms of reducts [17]. 

For decision tables, we are interested in the classification power of the subsets 
of condition attributes. A set of attributes B C A is called a decision oriented 
reduct (or a relative reduct) of decision table S = (U, AU {dec}) if and only if B 
is a minimal subset (with respect to the inclusion) of A satisfying the property 
that for any pair (x;,x;) of objects, if dec(x;) # dec(x,;) and (2;,2;) ¢ INDs(A) 
then (xi, 25) ¢ INDs(B). 

Rough set methodology to classification problems make use of the rule-based 
classifiers see [2]. Therefore, searching for a set of high quality rules is the one 
of the fundamental problems in Rough Sets. In fact, decision rules are implica- 
tions in description logic which present the relationship between conditional and 
decision attributes. Lets consider the simplest form of decision rules: 


r= (a;, =) A...A(a;, =Um) > (dee = d) (1) 


Thus the premises of decision rules are the conjunctions of descriptors of form 
a=vforac€ Aand vec Vy. 

Each decision rule r is characterized by its length(denoted by length(r)), 
which is the number of descriptors in the conjunction, by strength(r) — the 
number of objects satisfying the premise of r and confidence(r) - the ratio 
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of the number of objects satisfying r to the strength of r. We say that r is a 
consistent rule with the decision table S if and only if confidence(r) = 1. 


2.1 Support Vector Machine 


Any decision table S = (U,A U {dec}) with real value attributes and U = 
{u1,...,Un} can be interpreted as a training set containing object-decision pairs 


= {(xX1,d1),.-., (Xn, dn)} 


where x; € R™ is the information vector of object uj € U and d; = dec(u;) € 
{—1,1} for 1 < i <n. The support vector machine (SVM) problem is defined 
in [18] as the problem of searching for a linear classifier L : R™ — {—1,1}, of 
the form L(x) = w - &(x) + b, where @ is a function that embeds the objects 
from R™ into a higher (even infinite) dimensional space, such that L properly 
classifies the objects from D and the margin of DL is maximal. These conditions 
can be described by the following optimization problem 


_ |wI|? 
min > +0) 16 


subject to d;(w - ®(x;) +b) >1-—&; fori=1,...,n (2) 


If @ is a special embedding function, for which there exists a function K : 
R™ x R™ — R (called the kernel function) such that K(x;,x;) = (W(x:), Y(x:)) 
for any x;,x; € R™ then, applying the Lagrange multiplier strategy, this problem 
can be transformed to the following quadratic optimization problem: 


n ir n 
max 5° ay 3 » a,a,d,d;K(x;,x;) 
i=1 


ij=l 


subject to C >a; >0,i=1,..,n and So aid; =0 (3) 


i=l 


This later formulation, called the dual representation SVM problem, changes the 
searching space, because it is looking for the optimal significance (non-negative) 
weight a; for each object u; € U. Moreover, if te vector (af,...a°) is the optimal 
solution of the problem (3), then the objects u; corresponding to a? = 0 are said 
to be redundant and should be removed from the training decision table. The 
remaining objects (corresponding to a? > 0) are called the support vectors and 
they can be used to define the SVM classifier as follows: 


dsvm (x) = sgn ( y dia? K (x;,x) + m) (4) 
sup. vectors 


where sgn is the sign function and x € R™ is the information vector of an 
arbitrary new, unseen object. 
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2.2 SVM with Boolean Kernels in Learning Rule-Based Classifiers 
for Decision Tables with Symbolic Values 


In rough set theory, classification problem can be solved by the same techniques 
used for the concept approximation problem. The idea is to construct some 
“rough classifiers” from the training data. Rough classifiers should contains the 
description of the lower and upper approximations of all decision classes. Instead 
of hard prediction for new unseen object, rough classifier can return a vector of 
positive real values that describe to which degrees we believe that the considered 
object belongs to decision classes. 

For a given decision table S = (U, AU {dec}), the typical process of building 
rough classifier starts with the rule induction step in which a set of short, strong 
(with high support) and high confidence decision rules RULE(S) is generated 
from decision table S. Let RULE(S) = {R1,..., Rp} and let w; be the strength 
of rule R; and dec(R;) the decision class of the decision rule R; for i =1,...,p. 

In order to classify a new unseen object x ¢ U we should implement the 
function Match(x, R;), which returns the degree in which the object x matches 
to the decision rule R;. Let MRULE(S,x) = {R; : Match(x, R;) > 6} (for some 
5 > 0) denotes the set rules from RULE(S) which are satisfied by x. 

The final decision returned by the rough classifier based of the decision rule 
set RULE(S) can be calculated by a voting step, which can be formulated by 
the following formula: 


DecruLeE (x) =S$ S- dec(R;) "Wi Match(x, R;) (5) 
RiE€M RULE(S,x) 


where S is a function that, similarly to the activation function in artificial neural 
networks, converts the total voting power into one of decision classes [14]. 

We can see that two Eqs. (4) and (5) are quite similar. This observation was 
the main motivation in our previous paper [15]. We had prove that for a symbolic 
value decision table S = (U, AU {d}) with two decision classes, the SVM with 
the following kernel function can implement a rule-based classifier for S. 


K5(us,uj) = (1+6)!¥s! 1 (6) 


where u;,u; are the arbitrary objects of U, and |.N;,;| = |{a € A: a(ui) = a(u;)}| 
is the number of attributes, for which u; and u; are indiscernible. 

Moreother, if € is a very small positive value, for example ¢ < 0.05, one can 
approximate the value (1 +¢)= by e, therefore 


K35 (i, X;) © eal 
3 Discretization Problem in Rough Set Theory 


Many classification methods including the rule-based classification in Rough Sets 
are not applicable for data sets with numeric data. Moreover, the experimental 
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results are showing that the accuracy of many classification methods can be 
significantly improved if the data set has been early discretized. In general, 
discretization is a partition the real axis into disjoint intervals. Such partition 
can be defined by a relevant set of cuts (i.e., boundary points of intervals). 

Let S = (U, AU {dec}) be a decision table where all attributes are numeric, 
ie. V, C R. For any real value c € V,, the pair (a,c) is called a cut on attribute 
a, because it defines a partition of real axis into two intervals. We will say that 
two objects u;,u,; € U are discernible by a cut (a; c) if they are lying on different 
sides of the cut (a,c). The discernibility can be verified easy by checking if 
(a(ui) — c)(a(uj) — ¢) < 0. 

A set of cuts C is called S—consistent if for any pair of objects ui, u; € U such 
that dec(u;) 4 dec(u,) if (ui,uj;) ¢ LN Dgs(A) then u; and u; are discernible by at 
least one cut from C. By optimal discretiztion problem we denote the problem of 
looking for a S — consistent set of cuts that is minimal with respect to inclusion. 

One of the algorithms for optimal discretization has been proposed in [13, 14]. 
The idea was based on Boolean reasoning methodology. In this method, the 
optimal discretization problem for a given decision table S = (U, AU {dec}) was 
transformed to a corresponding boolean function ¢s. This transformation has 
the following property: the set of cuts is S-consistent <> the corresponding set 
of Boolean variables is a prime implicant of ds. 

In fact, the boolean function ¢g that can encode the optimal discretization 
problem was defined by 


os = II S- P(a,c) (7) 


dec(u;)Adec(u;) u;,u;are discernible by (a,c) 


where pq,-) is a boolean variable corresponding to a (a,c) from a given set of 
candidate cuts C. 

The presented above encoding indicates that all heuristics for prime implicant 
problem can be also applied to solve the optimal discretization problem. In 
particular, the mentioned above SVM with Boolean kernels can be also modified 
to learn the optimal discretization. 

The main idea of our proposition is to modify the Boolean kernel K5 
described in the previous Section. Let us define the new Boolean kernel as fol- 
lows: 

Kbigel titty) = (1+ e) 21m) — 1 (8) 


where € € Rt is a real positive parameter, C is the set of candidate cuts, mj,,; 
is the number of cuts from C that discern the objects u; and u,; of the given 
decision table. In case of decision table with real value attributes, instead of the 
standard decision rules in Eq. 1, we should use the interval decision rules, i.e. 
the decision rules of the form: 


r= (a, € (,71)) A... A (ai,, = (ln, Tm)) => (dec = d) (9) 


We can see that interval decision rules can be used build rough classifiers instead 
of the standard decision rules. 
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We have the following: 


Table 1. The “weather” daecision table (left) and its boolean representation (right) 


> 
= 
ne) 

= 

=| 
<= 


Theorem 1. For a given real value decision table S = (U, AU {dec}) with two 
decision classes. The SVM using the Boolean kernel shown in Eq. (8) can simu- 
late a classifier using the interval decision rules for S. 


One can prove this theorem by transformation of the original decision table 
S = (U,AU {dec}) with real value attributes in to a new binary decision table 
Spin = (U, Ac U {dec}) containing the same set of objects but instead of real 
value attributes from A, we create a new set of attributes Ac on the base of the 
set of C. For each cut (a,c) € C we define two boolean attributes namely ta<c 
and ta>c, where for any object wu € U we define tac-(u) = 1 iff a(u) < ¢ and 
ta>c(u) = 1 iff a(u) > c. One can see that applying the Boolean kernel presented 
in Eq. (6), we receive the kernel function shown in Eq. (8). 

The only problem that we should take under attention in implementation is 
the computational complexity. Let n be the number of objects and m be the 
number of attributes. Then, the boolean function shown in Eq. (7) consists of 
O(n?) clauses, each clause can consists O(|C]) variables. In the worse case, the 
time complexity for construction of the kernel matrix for SVM can be O(n?-|C]). 


4 Rough Sets and SVM Hybridization 


As it has been mentioned in Sect.2, most of rough classifiers operate a set of 
decision rules, and the main difference between methods of their constructions 
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is based on the rule induction and the rule selection steps. Among the existing 
rule selection methods, one can recall for example the covering algorithm LEM2 
[7], or sampling technique called “dynamic reducts” [2]. In this paper we propose 
another technique based on SVM with Boolean kernel. We will show that, the 
SVM with Boolean kernel can be applied as a pre-processing step. The rule 
induction method from Rough Sets can be restricted to the support vectors 
returned by SVM. 

Consider decision table “weather” shown on the left hand side of Table 1. 
This table contains 14 objects and 4 attributes. Table2 presents the set of all 
possible decision rules for this table using RSES system. For the new, unseen 
object x = ( sunny, mild, high, TRUE ), we see that only rule nr 3 and rule nr 
13 are exactly satisfied by «. The rule number 3 has higher strength (satisfied by 
3 objects from decision table) and has a negative decision (we denote this fact 
by —3) while the rule nr 13 is satisfied by only one training object. The voting 
process should returns the decision “no”. However, if we accept also a partial 
matching, the voting process should take under consideration 10 rules. However, 
in both cases the object x is classified into the class “no”. 


Table 2. The classification process of object x = ( sunny, mild, high, TRUE) using 
the set of decision rules generated by RSES system 


Nr | Condition => Dec | Strength | Match 
1 | (outlook=overcast) => yes |4 0 

2 | (humidity=normal) A (windy=FALSE) => yes |4 0 

3 outlook=sunny) A (humidity=high) =>no |-3 1 

4 | (outlook=rainy) A (windy=FALSE) => yes |3 0 

5 | (outlook=sunny) A (temp.=hot) => no |-—2 1/2 

6 | (outlook=rainy) A (windy=TRUE) =>no |-2 1/2 

7 | (outlook=sunny) A (humidity=normal) => yes | 2 1/2 

8 temp.=cool) A (windy=FALSE) => yes | 2 0 

9 | (temp.=mild) A (humidity=normal) => yes | 2 1/2 
10 | (temp.=hot) A (windy=TRUE) =>no |-1 1/2 

11 | (outlook=sunny) A (temp.=mild) A (windy=FALSE) => no |-1 2/3 
12 | (outlook=sunny) A (temp.=cool) => yes | 1 1/2 
13 | (outlook=sunny) A (temp.=mild) A (windy=TRUE) = yes | 1 1 

14 | (temp.=hot) A (humidity=normal) => yes | 1 0 
Table 1 also illustrates the SVM using boolean kernel KS5(.,.), with ¢ = 0.2. 


The original decision table consists of 10 descriptors and we can see in Table 1 the 
boolean representation of the original decision table. The last column represents 
the optimal a values of each objects. Let us remind that the a values can be 
interpreted as the importance of the objects in the classification process, and 
the objects related to positive a value are called the support vectors. Therefore, 
the decision table has 8 support vectors (four objects for each decision class). 
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If we remove from weather decision table 6 objects: 2,3,5,7, 10,13 corre- 
sponding to zero value of the parameter a, and use the RSES system for this 
restricted table, we can obtain 12 among 14 decision rules from Table 2. We lost 
only two rules with quite low strength (rules nr 10 and 14). This observation 
some how supports our hypothesis that SVM can be applied as a pre-processing 
step for rough set methods to reduce the redundant objects. 

This observation motivates the authors to propose the following hybrid clas- 
sification method: 


SVM-based rough classifier: 


1. Run SVM algorithm with the kernel function presented in Eq. (8) to find the 
support vectors 

2. Build the rough classifier the decision table restricted to the support vectors 
only. 


This classification method combines both reduction techniques, i.e. feature 
reduction from rough set theory as well as the object selection from SVM. In 
case of larger data sets the SVM algorithm can be replaced by SVM light [8]. 

We performed some experiments on small benchmarking data sets to check 
the accuracy of Boolean kernels and compare their accuracy with other rule 
based classifications methods. For each data set, the accuracy was calculated by 
averaging using 5-fold cross validation. 

The accuracy of existing methods are given in the columns marked by: CN2, 
RSlib and Other. This columns were taken from paper [2], where the author 
presented the accuracy of CN2 algorithm [4], the best possible rough set based 
classifier (using discretization, dynamic reduct, ...) and the best accuracy on this 
data achieved by other techniques. 

The accuracy of three methods proposed by the authors: (1) SVM using 
Boolean kernel Ky (described in [15]), (2) SVM using Boolean kernel K2 and 
(3) hybrid method using kernel Kp;,- are reported in last 3 columns of Table 3. 
During the experimental work on each data set, we checked the classification 
accuracy for different values of the parameter ¢. We noticed a very interesting 
fact that the best classification accuracy were achieved when ¢ was nearby 1, 
where n is the number of objects in the decision table. 

In Table3 present the accuracy of proposed algorithms for ¢ = 4. It is inter- 
esting to notice that in all cases, the number of support vectors did not exceed 
50% of the total number of objects. 


Table 3. Accuracy of the SVM method using different Boolean kernels. 


Dataset Dimensions | CN2 | RSlib | Other | ky Ke Kpdise+RSlib 
Tris 150 x 4 0.96 |0.97 |0.97 |0.97 |0.95 | 0.97 
Diabetes | 768 x 9 0.711 | 0.745 | 0.777 | 0.759 | 0.761 | 0.775 
Australian | 690 x 14 0.796 | 0.854 | 0.87 | 0.823 | 0.817 | 0.864 
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One can see that the rough classifiers build by the support vectors have at 
least the same or better accuracy comparing to the rough classifiers that were 
constructed using whole data sets. 


5 Conclusions 


We shown that many basic methods in rough set theory can be also simulated 
and approximately calculated by SVM with Boolean kernels. The main idea is 
based on the fact that most of rough set methods can be solved by Boolean 
Reasoning approach, i.e. the optimal problem in Rough Sets can be encoded 
by a boolean function. We also proposed a hybrid classification method that 
combine a object selection feature of SVM with feature selection and rule-based 
classification method in Rough Sets. The experimental results on some small 
data sets are showing that this combination is quite promising. We plane perform 
more experiments with different heuristics for SVM, including SVMlight [8] to 
make rough set methods more scalable. 
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Abstract. In this paper, we propose a new method to classify the bound error val- 
ues of Kullback-Leibler Distance (KLD)-particle filter (PF) based Support Vector 
Machine (SVM) to reduce the mean number of particles used (sampling) as well as 
improve the performance of runtime in reality for monitoring an object. In wireless 
sensor network (WSN) system, the object location is calculated via the collected 
received signal strength (RSS) variations which are effected by furniture, walls or 
reflections. Therefore, we propose an architecture diagram to track an object and 
build the dataset model. By transforming the system state model from the 1D to 
2D, the bound error value of KLD resampling can enhance estimation accuracy 
and convergence rate of declining number of particle used by generating a sample 
set near the high-likelihood region for ameliorating the effect of the RSS varia- 
tions. Our proposal considers how to classify and find the bound error values of 
KLD PF for each iteration. The first iteration, using the observation information 
via KLD resampling optimal bound error to conduct a resampling on the basis of 
the initial bound error. From the second to the end iteration, we propose the SVM 
technique to search the predicted bound error value that fulfills the minimum of 
mean number of particle used between at the current and the next iteration. Our 
experiments confirm this technique to apply in reality system. 


Keywords: KLD-resampling - WSN - KNN - LDA. SVM 


1 Introduction 


Using weighted particle set with assigned primary weights serves as the basic idea 
of a particle filter (PF) is one of these methods to improve the location of object in 
space, called a recursive Bayesian filter in [1]. Monte Carlo method is also used a set of 
weighted particle to realize the recursive Bayesian filter for an effective nonlinear non- 
Gaussian system, called suboptimal prediction method, applied to monitor behavior of 
object in [2]. 
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The estimation accuracy of object is effected by the degenerate set of particles during 
the resampling step of PF. The authors in [3] introduced initially employed to combat 
sample impoverishment (or degeneracy) by supplanting high-weighted to low-weighted 
particles in [4]. It is useful to decrease the probability of losing object. Some authors 
investigated the effects of choosing metric and weight functional approach on PFs in [5— 
9]. Firstly, the PF based on Kullback-Leibler Distance (KLD)-sampling is used to deter- 
mine the minimum number of particles needed to maintain the approximation quality in 
the sampling process. By fixed probability and given the upper bound error in sample 
set size of KLD-sampling in [9], the operation time of object is improved. Secondly, the 
predictive certainty state is used in the KLD-sampling in [7] to estimate the underly- 
ing posterior for improving the micro-ability and adaptability of particle set. The noise 
variance of the new information estimation system is determined based on reflect rela- 
tionship between the accuracy of the target prediction and the uncertainty of the system. 
It uses to determine the sampling of the proposed distribution. Thirdly, the authors in [10] 
enhanced the ability to predict the particle set via the new information of observation to 
control the number of particles double sampling. Finally, the authors in [11] applied the 
trained network for KLD sampling to generate the new bin size though space division 
by KD Trees that helps balance between approximation error and runtime. The authors 
in [5, 8, 14] proposed KLD-resampling algorithms to find the number of particles based 
on the distribution of particles before and after process reaches a pre-specified bound 
error. Our researches in [8, 14] proposed an enhanced KLD-resampling PF by finding 
optimal bound error. But the optimal bound error in here is maintained during the online 
training data, it is still open problem. 


Related Works 
To overcome this problem, first we propose architecture diagram to collect and build 
dataset of the bound error values for online training phase an object in WSN in [8, 14, 18]. 
Next, our dataset model considers to transform the features (iterations of predicted bound 
error) to many hyperplanes in case of increasing noise variance during two adjacent 
iterations. Final, under supervised machine learning via SVM technique to separate the 
overlap bound error values. It helps to classify the training bound error value of KLD-PF 
based on the minimum weight vector by applied Karush-Kuhn-Tucker. As a result its 
mean number of particles used reduces significantly. Our experiments show that applied 
supervised machine learning in [13] to predict bound error for KLD resampling based 
RSS measurements in WSN system in [8, 9, 12, 8-9] improves the location and runtime 
(or the number of particles used). Our methods are also the another latest technique in 
[11] to apply the trained data for KLD-resampling by generating the predicted bound 
error based supervised machine learning that helps balance between approximation error 
(or Root Mean Square Error-RMSE) and runtime. 

The paper is organized as follows. Introduction to system is given in Sect. 2. Our 
proposal is introduced in Sect. 3. All results based on PYTHON for monitoring are 
shown in Sect. 4. Finally, we recommend the future work in Sect. 5. 
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2 System Model 


In this paper, we consider the robot carrying the sensor node and some static nodes. 
The robot moves along the determined path and a velocity in the long and thin region. 
This node can send the data to static nodes and receive the data from static nodes. 
Using PF algorithm to find its position. All sensor nodes that have own equal physical 
parameters, and its movement velocity remains the same at time. The random velocity 
follows Gaussian distribution. This system contains three models: 1. Mobility, 2. RSS 
statistical, and 3. System state models. 


2.1 Mobility Model 


The time is split into equal time section as the moving robot along the path at the steady 
value. The robot’s velocity has some random noise that matches normal distribution. 
Figure | indicates and determines the random velocity. Let us denote v’ (a random 
variable which conforms to normal distribution) and v” to be the random velocity and 
the determined velocity, respectively. As result, the robot’s velocity is presented by the 
dash line. 


Fig. 1. The velocity of the robot in [12] 


2.2. RSS Statistical Model 


The statistical model for RSS shows the relationship between RSS and node distance. It 
is possible to express the popular mathematical model of WSN as follows 


P(d) = P(do) — 1onioe( =) + Vo; (1) 
0 


where P(d) is the value of RSS at position d, and P(do) is the RSS at reference (typically) 
position dg = | m. The path loss parameter related to the specific application environment 
is presented by n; and v, is a Gauss stochastic variable. 
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2.3 System State Model 


The system state model for the mobile wireless sensor in [12] is defined as follows 


Xp = Xe-1 + Ve At + wx, (2) 


Zk = Pref + Klog(xx) + Rux, (3) 


where x, is the position of a mobile node, At is the time segment, and z;z is the RSS 
measurement. V, is the current velocity to determine velocity and random velocity 
in Eq. (1). The system state and measurement noises obey Gauss distributions. The 
reference value of RSS is Pref and the factor in path loss is K. 

The system state model in Eqs. (2) and (3) can be rewritten Eqs. (3) and (4) for 
finding the required number of particles as 


1100 0.5 0 
0100 1 0 WL k 
= 4 2 on? 4 
Xk 0011 (xp-1 + Ve At) + 0 05 of 8 | (4) 
0001 0 1 
Zk = Pref +Klog(arctan(x1,k, X3,k)) + Rox. (5) 


3 Proposal Techniques 


3.1 Architecture Diagram 


We propose the architecture diagram model to train the bound error of KLD PF in Eq. (7) 
via Machine learning in WSN systems, shown as in Fig. 2. 

First, our diagram collects the observation information (such as RSSI from Zig- 
Bee, LoRa systems) based the system state model Eqs. (2)—(3) in [12]. Let us define 


ef leh ote aa is the bound error that used to find the data of mean number of particles 


N saad: 7 
used in Eq. (7), where S is the length of e!. The data found as N/= tote ; 

Nv Nou 
where f is the feature train. 

The authors in [1] introduced weighted particle set with assigned primary weights 
serves as the basic idea of a particle filter (PF) to estimate the object location. This is a 
suboptimal prediction method applied in the field of object tracking in [2]. By sorting 
non-domination of these individuals in the population, the author in [7] proposed the 
fast KLD-sampling technique in the sampling process. At each iteration of the PF, it 
determines the number of samples with probability 1 — 6 of the error between the true 
posterior and the sample-based approximation is less than given threshold value (e). 
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5 ASE. MACHINE LEARNING 
COLLECT DATA DATABASE “ALGORITHM 


Update until 
processing end Update model 


qj 
6a) 
a: Build Candidate Model 
éa | 

3 


Data Processing * Label : Bound error (e) 
* Missing value Feature: Valuc of cach iteration 


Selected | Deploy Model Prediction value 
___Model _) 


Fig. 2. Architecture diagram to track a target based supervised machine learning-KLD in WSN 


Here, KLD is used to determine the number of samples (or the distance) between the 
sample-based maximum likelihood estimate and the true posterior does not exceed a 
pre-specified threshold ¢. The KLD between the proposal q and p distributions can be 
defined in discrete form as follows 


p(x) 
dx (pliq) = ptsitor( PO). (6) 
X q(x) 
2 
The required number N, of samples can be determined as follows Ny = wots 


where k is the number of bins with support, the quantizes of Chi-square distribution can 
be computed as follows Pac < Foe 4) =1-6. 
The mean particle used criterion is collected as follows 


yak, 2 2 : , 
roe Sem) Nem ) 


where zj_5 is the upper quartile of the standard normal distribution. 


3.2 Dataset Model 


Our current work, Ly-Tu et al. in [8, 14, 18], introduced that the collected bound error 
range of KLD-resampling in [8] (see in Algorithm 6) from 0.7 to 0.975, the value of 
variances R and Q in Eqs. (2) and (3) from 0.1 to 0.9, in all cases of number particles N 
(N = 100, 200, 600) to track the target in WSN system. Based on this model, the mean 
number of particle used, bound error, and runtime are stored in one file excel, shown in 
Fig. 3. 
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Mean number of particle used 


Iterator = number of iterator 
values. value of iterator 


N: Number of particle 
R std of measurement 
@ Std of process 
using for KLD sample 


© = using for KLD sampling 
€1: using for KLD resampling 


Iterator > number of iterator 
values: value of iterator 


Iterator - number of iterator 
>| values. value of iterator 


Position 


Iterator - number of iterator 
*| values: value of iterator 


Fig. 3. Class diagram of our architecture 


Epsilon Iterl Iter2 Iter3 2ee Iter37 Iter3s Iter3s Iter4o0 
°o 0.700 20.95 10.00 6.00 Sars 6.930 6.35 6.35 6.95 
a 0.700 20.30 7-95 5.35 eee 6.65 6.80 6.75 6.80 
2 0.700 21.10 8.20 S.35 eee 6.85 6.90 6.95 7-00 
3 0.700 20.45 8.60 6.15 See 6.85 6.90 6.85 6.75 
4 0.700 20.65 3.-1s S.-Sso eee 6.90 6.95 6.95 6.3sS 
Ss 0.700 20.40 8.30 S-6S eee 7-00 6.95 6.935 7-00 
6 0.700 20.20 8.25 4.390 Shite. 7-00 7-00 7-00 7-00 
ef 0.700 20.95 8.ss S.45 aoe. 6.90 7-00 6.95 6.90 
3 0.700 20.6S 83.95 6.00 Ssh 7-00 7.00 6.95 7-00 
Ss 0.700 19.90 7-35 S.-So eee 6.95 6.930 6.95 6.95 
453 0.975 14.70 4.65 3-35 Suse 2.00 2.00 2.00 2.00 
4s4 0.9735 15.00 S.40 2.45 a ee 2.00 2.00 2.00 2.00 
455 0.975 14-55 S.-S0 3-55 Shere. 2.00 2.00 2.00 2.00 
456 0.975 14.75 S.30 3-75 eee 2-00 2.00 2.00 2-00 
457 0.975 14.30 S.10 3.75 Parad 2-05 2.05 2.05 2-05 
4s3s o.9s75 14.45 S.70 3.273 pears 2.05 2.05 2.05 2.05 


[459 rows x 41 columns] 


Fig. 4. Dataset of 9 classes of Epsilon in case of N = 100 


An example of values R = Q = 0.5, N = 100 for 9 classes of Label (called bound 
error in Fig. 2) namely Epsilon in Fig. 4. Each class has 51 rows and 41 columns. The first 
column is described the label of class (Epsilon) and the 40 next columns are assigned 
as 40 features. In order to overcome the process of these missing values (Pre-processing 
in Fig. 2), we follow the first method of four ones in [13] to remove these missing data. 

Our Pseudo-code of finding predicted bound error using SVM method is mentioned 
in Table 1. Let us define T as a training set T = { f, el}, which f is the feature train (line 
7, Table 1) and f is a subset of N/, size of f is S x [m : m+ 1]. Let us denote f‘* is 
the feature test (line 8, Table 1). Hence, f & f'®*’ will be updated in each iteration. 
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Table 1. Finding the predicted bound error based on SVM 


red 
R, Q, m, Nees 


max: ae 


R, Q to get patch 


1: procedure (N 
2: Check N 


max? 


3: Load ¢! from patch [> Label train 


Load N/ from patch [> Prepare data for feature train 
ARMSE}. = ARMSESPF = 0 


Pro Pro 


while (S!=@ ) do 
I soaney = [Nén Nice | > Feature train 


: f= [No , min(N oie Ss) | >Feature train 


m 


9: idx=index (min (nN eh )) > obtain index and remove missing data in [13] 
10: neigh = SVMClassifier( gamma =10) [> Run SVM model 
ll: neigh.fit ( f, é) > traning set model 
12: e temp < neigh.predict ( f nee ) > predicted bound error 
13: RMSE@" omy b> RMSE of proposal with the bound error at the m'" iteration ( grey 
14: RMSE" [> RMSE of SIR 
15: RMSE“° [>RMSE KLD-resampling (¢ = 0.65 ) 
16: RMSE“ [> RMSE KLDE (¢,, = 0.9 in [8]) 
17: if (ARMSE,", 20 & &ARMSE,.” 20 ) then 
18: ée™ =. temp P find 6” 
19: break; 
20: else > find new € 
21: el =< vremove.€.,, > remove and update label train 
l 
22: S=len(e ) > Update S 
23: Ni = row.delete.N(idx) >remove and update mean number of particles used for 
new feature train 
24: end if 
25: end while 


26: return en > find e"™ 


m 


27: end procude 


The objective of our proposal is to find the bound error (Epsilon) to reach the minimal 
of mean number of particle used based KLD-resampling adjusted bound error in [8] (see 
in Algorithm 6), therefore our works introduce the bound error algorithm with the initial 
bound error in [8] applied the first iteration and the predicted bound errors based KNN 
or LDA or SVM (line 10, Table 1) as shown in Fig. 5. 

To more analysis this dataset in Fig. 5, three classes are selected as the first class 
(0.7), the middle (0.875) and the last ones (0.975) for evaluating the overlap them during 
the first four iterations vs. the effect of the variance noise Q in Eq. (2) or the fluctuation 
of RSS measurements in Eq. (3) as shown in Fig. 6 (see the next page), we can select the 
candidate model to deploy in reality. Here, if the predicted bound error is not satisfied 
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Fig. 5. Predict the bound error for our dataset model 


conditions the performance of RMSE criterion in [8], it is removed out of the selected 
list. The output of our architecture diagram in Fig. 2 is the predicted bound error which 
fulfills both the mean number of particle used and RMSE. 
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Fig. 6. The diagonal chart of analysis for three selected classes. 


3.3 Support Vector Machine Technique to Classify the Bound Error Classes 


There are many different hyperplanes which classify correctly all data points but the 
best choice will be the hyperplane that leaves the maximum of margin from both two 
classes. Following the example in Fig. 6, let us define Z; and Z2 are the margin between 
two classes of Epsilon 0.85 (blue color) and 0.975 (green color), respectively, during 
two adjacent iterations shown in Fig. 7. 
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Fig. 7. The margin hyperplane of two classes 


Figure 7 shows that the red and blue hyperplanes (separated by the red line and the 
blue line) with Z;(Z; = Zi ) and Z2(Z2 = Z5) are the margin of two classes, respectively. 
Here, the red hyperplane is chosen because of Z; > Z2. The function of its hyperplane 
is determined as follows in [17]. 


g(x) = BX +wo (8) 


where g(®) is the label of hyperplane with g(x) > i, VX blue or g(®) < 
—l, Vx green and wl, wo are the weighted parameters of its hyperplane. The total 


margin is I val + I aT = rie To find the minimum of weight | | for nonlin- 


ear optimization, Karush-Kuhn-Tucker (KKT) is applied by Lagrange multipliers A; as 
follows 


N 
B= Ay Fi, (9) 
i=0 
N 
r= 0 (10) 
i=0 


Figure 8 (see the next page) shows an example to find the weight vector with 
two coordinates A(15, 6) and B(16, 6.5). The weight vector is calculated as B= 
(16, 6.5)—(15, 6) = (a, 0.5a). 


g(A) = -14 15a +6a + wo= —1 (11) 


2(B) = 1 16a +3.25a + wo= 1 (12) 


From Equation (11)—(12), the value of ais 1.6 and wo = —29.8. Therefore the weight 
vector of SVM @ = (1.6; 0.8) and g(®) = 1.6x; + 0.8x2 — 29.8. As a result, the 
blue hyperplane was contructed, and the predicted value will belong the class which 
represents for the bound error value of each iteration shown in Fig. 9 (see the next page). 
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Fig. 9. Hyperplane and the predicted value for a small set of database 


4 Simulation Results 


Setting up to track an object of our systems in [8, 14, 18] as follows R = 0.5, Nmax = N; 
Vinax = 93 Vin = 15 Vinit = 53 Prep = —23; K = —45; a number of samples N = 100, a 
range of variances Q (0.1,0.3,0.5,0.7,0.9), length time is 40 for sample size variation in 
20 trials as shown in Tables 2 and 3. 

Table 2 verifies that SVM-KLD of our proposal dominates about the performance of 
the number of particles used when compared to others. This is because it can separate 
the overlap bound error value when increasing Q from 0.1 to 0.7 (shown in Fig. 6) by 
its margin hyperplane (shown in Fig. 7). This approach can classify the training bound 
error value in Eq. (7) based on the minimum weight vector (Fig. 8) in case of many 
hyperplanes (shown in Fig. 9) leading the mean number of particles used of our method 
reduces significantly. 
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Table 2. Mean number of particles used 


Q_ | Mean number of particles used 
KLD | KLDE | KNN-KLD | LDA-KLD | SVM-KLD 
0.1 | 15.78 | 6.64 | 6.39 6.02 5.54 
0.3 | 10.18 | 3.29 | 3.53 3.23 3.11 
0.5} 8.39| 2.73 | 2.65 2.88 2.54 
0.7} 6.53} 2.34 | 2.33 2.36 2.31 
0.9} 4.61}2.21 | 2.26 2.20 2.25 


The performance of RMSE value and runtime for all methods are shown in Table 3. 
It shows that some values of RMSE of SVM-KLD, our technique, in case of Q = 0.3 
and 0.9, is not good for others under supervised machine learning (KNN-KLD in [15] or 
LDA-KLD in [16]); but it is better than others without machine learning (KLD or KLDE 
in [14, 18]). This is thanks to the optimal choosing these values of the bound error that 
trained from KLDE in [14, 18]. However, the runtime of our method is very good due 
to the reduced number of particles used for resampling (see more detail in Table 2). It is 


useful in many fast reality. 


Table 3. RMSE and runtime 


Q_ | RMSE: runtime (ms) 

KLD KLDE KNN-KLD | LDA-KLD | SVM-KLD 
0.1 | 15.78:3.67 | 6.64:0.75 | 6.39:0.68 | 6.02:0.63 | 5.54:0.63 
0.3 | 17.18:2.19 | 14.77:0.75 | 8.40:0.80 — 7.13:0.34 | 10.55:0.30 
0.5 | 18.49:1.17 | 25.08:0.39 | 12.50:0.57 | 12.34:0.39 | 12.30:0.39 
0.7 | 6.53:1.08 | 2.34:0.47 | 2.33:0.31  2.36:0.62 | 2.31:0.24 
0.9 | 23.66:1.69 | 42.18:0.78 | 19.71:0.52 15.89:0.63 | 16.96:0.65 


5 Conclusion 


This paper, we propose the SVM method to find the predicted bound of KLD-resampling. 
This technique reduces the fluctuations of RSS samples in WSN leading to improve the 
location of object. It verifies that reduces the number of particles used when compared 


to traditional methods. 
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Abstract. Our investigation aims at classifying images of the intangi- 
ble cultural heritage (ICH) in the Mekong Delta, Vietnam. We collect 
an images dataset of 17 ICH categories and manually annotate them. 
The comparative study of the ICH image classification is done by the 
support vector machines (SVM) and many popular vision approaches 
including the handcrafted features such as the scale-invariant feature 
transform (SIFT) and the bag-of-words (BoW) model, the histogram of 
oriented gradients (HOG), the GIST and the automated deep learning 
of invariant features like VGG19, ResNet50, Inception v3, Xception. The 
numerical test results on 17 ICH dataset show that SVM models learned 
from Inception v3 and Xception features give good accuracy of 61.54% 
and 62.89% respectively. We propose to stack SVM models using differ- 
ent visual features to improve the classification result performed by any 
single one. Triplets (SVM-Xception, SVM-Inception-v3, SVM-VGG19), 
(SVM-Xception, SVM-Inception-v3, SVM-SIFT-BoW) achieve 65.32% 
of the classification correctness. 


Keywords: Images of the intangible cultural heritage in the Mekong 
Delta - Image classification - Visual features - Support vector 
machines - Stacking 


1 Introduction 


The Aniage project! focuses on high dimensional heterogeneous data based ani- 
mation techniques for Southeast Asian Intangible Cultural Heritage (ICH) digi- 
tal content. It aims to develop novel techniques and tools to reduce the produc- 
tion costs and improve the level of automation without sacrificing the control 


" https://www.euh2020aniage.org. 
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from the artists, in order to preserve the performing art related ICHs of South- 
east Asia. The classification of ICH images” is the work package in the AniAge 
project. 

The main aim is to automatically classify the image into one of predefined 
ICH categories. It requires to collect high quality ICH images organized by their 
categories (classes/labels) and study vision approaches to classify ICH images. 
To pursue this goal, we build an images dataset of 17 ICH categories by querying 
a text-based web search engine of Google, followed which we manually annotate 
them. And then, we explore popular vision approaches to deal with the classifica- 
tion task of ICH images. The extraction of visual features are performed by three 
popular handcrafted features such as the scale-invariant feature transform (SIFT 
[15,16]) and the bag-of-words model (BoW [2, 13, 22]), the histogram of oriented 
gradients (HOG [7]), the GIST [18]. Recent pre-trained deep learning networks, 
including VGG19 [21], ResNet50 [10], Inception v3 [23], Xception [5] are used to 
extract invariant features from ICH images. And then, Support vector machines 
(SVM [24]) models are learned from visual features to classify ICH images. The 
numerical test results on 17 ICH dataset show that SVM models learned from 
Inception-v3 and Xception features give good accuracy of 61.54% and 62.89% 
respectively. We propose to stack SVM classifiers using different visual features to 
improve classification results given by any single one. Triplets (SVM-Xception, 
SVM-Inception-v3, SVM-VGG19), (SVM-Xception, SVM-Inception-v3, SVM- 
SIFT-BoW) achieve 65.32% of the classification correctness. 

The paper is organized as follows. Section 2 describes how to collect a dataset 
of ICH images and how to build classification models from vision approaches. 
Section 3 shows the experimental results before conclusions and future works 
presented in Sect. 4. 


2 Classification of Intangible Cultural Heritage Images 


The classification system of ICH images in Fig. 1 follows the usual framework for 
the classification of images. It involves three steps: (1) building the high quality 
dataset of images, (2) extracting visual features from images and representing 
them, and (3) training SVM classifiers. 


2.1 The Dataset of Intangible Cultural Heritage Images 


Firstly, we need to build the dataset of ICH images in the Mekong Delta, Viet- 
nam. Figure 2 shows an images sample of 17 ICH categories. Our proposal is 
to collect ICH images from Google due to the availability of this biggest public 
repository. It just does image search by textual query being key words related to 
17 ICH categories and retrieve them. However, there are still noisy and irrelevant 
images. And then we do the manual post-processing stage and tagging images 
to obtain the high quality images organized by their ICH categories. Table 1 
presents the dataset description with a total of 7409 images. 


? http://aniage.ctu.edu.vn/myproj. 
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Fig. 1. Framework for classifying ICH images 
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Fig. 2. Images of 17 ICH categories 


2.2 Visual Approaches for Classifying Intangible Cultural Heritage 
Images 


Visual approaches perform the classification task of ICH images via two key 
steps. The first one is to extract visual features from images and represent them. 
Followed which, the second one is to train SVM models to classify images. 
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Table 1. Dataset description of 17 ICH categories 


No|Category #Images 
1 |Dé6n ca tai t Nam Bo 513 
2 |Nghé thuat Cham riéng cha pay Khmer 185 
3 |Nghé Dét chiéu 642 
4 |Lé hoi cing bién My Long 398 
5 |Nghé thuat sin khau Dit ké Khmer 404 
6 |Lé hoi Ok om bok Khmer 465 
7 |Lé héi via Ba Chia X@ NGi Sam 405 
8 |Dai lé Ky yén Dinh Tan Phudéc Tay 223 
9 |Lé hoi via Ba Negi Hanh 569 
10|Lé hoi Lam chay 365 
11|Nghé dong xuéng ghe Long dinh 281 
12|Nghé Dan tre 641 
13|Lé citing Viéc 1é 447 
14|Lé hdi Dua bo Bay Nii 449 
15|Lé hoi Nehinh Ong 523 
16|Lé héi anh hing Truong Dinh 361 
17/Van héa cho néi Cai Rang 538 
Total 7409 


Three popular methods for handcrafted features include the scale-invariant 
feature transform (SIFT [15,16]) and the bag-of-words model (BoW [2, 13, 22]), 
the histogram of oriented gradients (HOG [7]), the GIST [18]. 


Scale-Invariant Feature Transform: The SIFT descriptors [15,16]) and the 
bag-of-words model (BoW) are the most commonly image represenation for tasks 
of images classification [2,13,22]. The SIFT method detects the appearance of 
the object at particular interest points, invariant to image scale, rotation, and 
also robust to changes in illumination, noise, and occlusion. 


Histogram of Oriented Gradients: The HOG descriptors are used for human 
detection [7]. The HOG method computes the distribution of local intensity 
gradients or edge directions to describe local object appearance and shape within 
an image. The combined distributions form the image representation. The HOG 
descriptor is invariant to geometric and photometric transformations, except for 
object orientation. 


GIST: The GIST descriptors proposed by [18] are used for images retrieval. 
The GIST method uses Gabor filters to extract the set of perceptual dimensions 
(naturalness, openness, roughness, expansion, ruggedness) that represent the 
spatial structure of a scene. 

Recent deep learning networks such as VGG19 [21], ResNet50 [10], Inception 
v3 [23], Xception [5] are pre-trained on ImageNet dataset [8]. These deep learning 
networks are used to extract invariant features from ICH images. 


VGG19: The VGG19 network architecture [21] consists of 19 weight layers for 
large scale image recognition. The VGG19 network uses only 3 x 3 convolutional 
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layers stacked on top of each other to develop depth. The max pooling layers are 
used to reduce volume size. From the input layer to the last max pooling layer 
are used as features extraction of images. 


ResNet50: The ResNet50 network architecture [10] is designed with 50 weight 
layers for image recognition. The ResNet50 develops extremely deep networks by 
proposed micro-architecture modules (called network-in-network). Furthermore, 
network layers try to fit a residual mapping instead of desired one. From the 
input layer to the last pooling layer or the last convolutional layer are used to 
extract image features. 


Inception-V3: The “Inception” module proposed by [23] is to learn multi-level 
features for image classification. The main idea uses 1 x 1, 3 x 3 and 5 x 5 con- 
volutions within the Inception module of the network. And then these Inception 
modules are stacked on top of each other. The reduction of volume size bases 
on 1 x 1 convolutions. From the input layer to the last pooling layer or the last 
convolutional layer are regarded as features extractor for images. 


Xception: The “Xception” network proposed by [4] is an extension of the 
Inception architecture. The Xception replaces the standard depthwise separable 
convolution (the depthwise convolution followed by a pointwise convolution) by 
the new modified one without any intermediate activation being the pointwise 
convolution followed by a depthwise convolution. Features extraction for images 
is performed by layers from the input layer to the last pooling layer or the last 
convolutional layer. 


Support Vector Machines: For a binary classification problem depicted in 
Fig. 3, the SVM algorithm proposed by [24] tries to find the best separating 
plane furthest from both class +1 and class —1. To pursue this aim, the training 
SVM algorithm simultaneously maximize the margin (or the distance) between 
the supporting planes for each class and minimize errors. 

The binary SVM solver can be extended for dealing with the multi-class 
problems (c classes, c > 3). The main idea is to decompose multi-class into 
a series of binary SVMs, including One-Versus-All [24], One-Versus-One [12]. 
The One-Versus-All strategy (as illustrated in Fig. 4) builds c different binary 
SVM models where the i*” one separates the i*” class from the rest. The One- 
Versus-One strategy (as illustrated in Fig. 5) constructs c(c — 1)/2 binary SVM 
models for all the binary pairwise combinations of the c classes. The class is 
then predicted with the largest distance vote. In practice, the One-Versus-All 
strategy is implemented in LIBLINEAR [9] and the One-Versus-One technique 
is also used in LibSVM [3]. 

SVM algorithms use different kernel functions [6] for dealing with non-linear 
classification tasks. The commonly non-linear kernel functions include a polyno- 
mial function of degree d, a radial basis function (RBF). 
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Fig. 3. Classification of the datapoints into two classes 
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Fig. 4. Multi-class SVM (One-Versus-All) 
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Fig. 5. Multi-class SVM (One-Versus-One) 
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3 Experimental Results 


In this section, we present experimental results of different visual approaches for 
classifying ICH images. We implement them in Python using library Keras [4] 
with backend Tensorflow [1], library Scikit-learn [19] and library OpenCV [11]. 
All experiments are conducted on a machine Linux Fedora 23, Intel(R) Core 
i7-4790 CPU, 3.6 GHz, 4 cores and 32GB main memory and the Nvidia GeForce 
GTX 960M 2GB GPU. 

The image dataset of 17 ICH categories in the Mekong Delta, Vietnam is 
randomly split into the trainset (6001 images), the validation set (667 images) 
and testset (749 images). We use the trainset to build visual classification models. 
Then, results are reported on the testset using the resulting visual classification 
models. 


3.1 Tuning Parameters 


We use the validation set to tune parameters for building visual classification 
models on the trainset. With methods for feature extractor and image represen- 
tation, only handcrafted features SIFT and BoW model needs tuning the number 
of clusters (visual words) well-known as the parameter of kmeans algorithm [17]. 
We try to vary the number of visual words from 1000 to 5000 for finding the 
best experimental results. And then, the results are unchanged while increasing 
the number of visual words over 2000. Therefore, we use 2000 visual words for 
the BoW model. 

With SVM models, we propose to use RBF kernel functions because it is 
general and efficient [14]. There is need to tune the hyper-parameter 7 of RBF 
function [K(x;,7;) = exp(—y||xi — x;||*)] and the cost C (a trade-off between 
the margin size and the errors) to obtain the best correctness. Finally, we find 
out best parameters’ SVM in Table 2 for visual classification models. 


3.2 Classification Results for 17 ICH Categories 


We obtain classification results of visual approaches in Table 3 and Fig.6. The 
highest accuracy is bold-faced and the second one is in italic. In the comparison 


Table 2. Hyper-parameters for training SVM models 


No | Feature extraction method | y C 

1 | SIFT and BoW 0.0001 | 1000000 
2 |HOG 0.1 100000 
3 | GIST 5 1000000 
4 | VGG19 0.005 100000 
5 | ResNet50 0.005 100000 
6 | Inception-v3 0.005 100000 
7 | Xception 0.005 100000 
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among visual classification approaches, we can see that methods for handcrafted 
features extraction such as SIFT-BoW, HOG, GIST are not suited for classifying 
ICH images. Recent deep networks (excepting ResNet50) for extracting invari- 
ant features from ICH images are most accurate results. Typically, Xception 
and Inception v3 achieve 62.89% and 61.54% in terms of overall classification 
accuracy, respectively. We also try to tune these pre-trained deep networks by 
re-training about their 10% of layers from our image trainset but obtained results 
can not be improved even degraded. 


Table 3. Overall classification accuracy for 17 ICH categories 


No| Visual approach | Accuracy (%) 


SVM-SIFT-BoW | 33.87 
SVM-HOG 32.93 
SVM-GIST 37.25 
SVM-VGG19 50.47 


SVM-ResNet50 34.14 
SVM-Inception-v3 | 61.54 
SVM-Xception 62.89 


NOD) Ot) BRB] WwW) mR 


3.3 Stacking of SVM Classifiers for Classifying 17 ICH Categories 


We propose to use voting scheme [25] among visual models to improve classifi- 
cation correctness for ICH images. The main idea is to combine multiple visual 
classifiers learned for the classification task by weighted voting between the pre- 
diction of each visual classifier VC; as illustrated in Eq. (1). 


Majority—vote{w, *pred(x, VC,)+we*pred(x, VC2)+---+w,xpred(x, VC;)} 
(1) 

Voting schemes always use the visual classifier SVM-Xception because this 
model gives the best result. Followed which, other visual models are included in 
voting schemes with the hope that the models can complement one another in 
the classification. Table 4 and Fig.7 show results obtained by weighted voting 
schemes. 

The couple of SVM-Xception and SVM-Inception-v3 improves 1.62% and 
2.97% of classification correctness against SVM-Xception and SVM-Inception- 
v3, respectively. 

The improvements of the triplet SVM-Xception, SVM-Inception-v3 and 
SVM-VGG19 over each single visual classifier are 2.43%, 3.78% and 14.85%, 
respectively. 

The triplet SVM-Xception, SVM-Inception-v3 and SVM-SIFT-BoW achieves 
the best accuracy as the triplet SVM-Xception, SVM-Inception-v3 and SVM- 
VGG19. It also improves 31.45% of the accuracy compared to SVM-SIFT-BoW. 
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Fig. 6. Overall classification accuracy for 17 ICH categories 


Table 4. Overall classification accuracy of voting schemes for 17 ICH categories 


No| Voting scheme Accuracy (%) 
0.85*SVM-Xception + 0.15*SVM-SIFT-BoW 63.29 
0.8*SVM-Xception + 0.2*SVM-HOG 63.02 

10 |0.8*SVM-Xception + 0.2*SVM-GIST 63.02 

11 |0.75*SVM-Xception + 0.25*SVM-VGG19 63.56 

12 |0.75*SVM-Xception + 0.25*SVM-ResNet50 63.02 

13 |0.65*SVM-Xception + 0.35*SVM-Inception-v3 64.51 

14 |0.55*SVM-Xception + 0.225*SVM-Inception-v3 + 0.225*SVM-VGG19 65.32 

15 |0.65*SVM-Xception + 0.22*SVM-Inception-v3 + 0.13*SVM-SIFT-BoW 65.32 


0.65*SVM-Xception + 0.22*SVM- 


Inception-v3 + 0.13*SVM-SIET-BoW lili 65.32 


0.55*SVM-Xception + 0.225*SVM- 


Inception-v3 + 0.225*SVM_.VGG19 Hai 65.32 
35'S nce, 3 64.51 
0.35*SVM-Inception-v3 . 
0.75*SVM-Xception + 0.25*SVM-ResNet50 [| 63.02 
0.75*SVM-Xception + 0.25*SVM-VGG19 [Fd 3.56 
0.8*SVM-Xception + 0.2*SVM-GIST |) 63.02 
0.8*SVM-Xception + 0.2*SVM-HOG [yy 63.02 


0.85*SVM-Xception + 0.15*SVM-SIFT-BoW [7 63.29 


60 61 62 63 64 65 66 67 
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Fig. 7. Overall classification accuracy of voting schemes for 17 ICH categories 
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4 Conclusion and Future Works 


We have presented visual approaches for classifying images of the intangible 
cultural heritage (ICH) in the Mekong Delta, Vietnam. We collect an images 
dataset of 17 ICH categories from Google and manually tagging images according 
to their categories. Visual approaches are used to deal with the ICH image 
classification. The feature extraction methods include three popular handcrafted 
features such as SIFT-BoW, HOG, GIST and four recent deep learning networks 
of invariant features like VGG19, ResNet50, Inception v3, Xception. Followed 
which SVM models are learned from these visual features to classify ICH images. 
The numerical results on 17 ICH dataset show that SVM-Xception and SVM- 
Inception-v3 give good accuracy of 61.54% and 62.89% respectively. We propose 
to use voting schemes between visual models to improve the classification result 
performed by any single one. Triplets (SVM-Xception, SVM-Inception-v3, SVM- 
VGG19), (SVM-Xception, SVM-Inception-v3, SVM-SIFT) achieve 65.32% of the 
classification correctness. 

These visual approaches can be used to re-rank images retrieved from Google 
and then we select top-ranked images for automated organizing ICH images by 
their ICH categories. It allows us to build a large number of images for a specified 
ICH category. Another approach [20] for developing the images database size 
combines textual and visual features. 
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Abstract. This paper presents an application of reconstructing Moderate- 
Resolution Imaging Spectroradiometer (MODIS) Enhanced Vegetation Index 
(EVI) time series to extract the water area in the lowland region of the Mekong 
River. Firstly, MODIS13A1 EVI time series with land surface reflectance 16- 
day and 500 m spatial resolution is collected from 2000 to 2017, resulting total 
of 411 images. Then, these images are used for reconstructing EVI time-series 
by using the Whittaker smoother method on the Google Earth Engine. Next, the 
water area in each image is computed based on the smooth EVI value. The 
results showed that the extracting water areas in year 2000 was in line with the 
observed water elevation at Tan Chau station (for the Mekong Delta) and at 
Phnom Penh station (for the Cambodian region). The correlation coefficient 
between the extracting water area and water elevation equals to 0.885 for the 
Mekong Delta while its value is 0.924 for the Cambodian region. The extracting 
water area from MODIS13A1 EVI for the lowland region of the Mekong River 
can be used for assessment of inundated area resulting from different flow 
conditions as well as for studying inundation processes in the lowland region of 
the Mekong River when using hydrodynamic models. 


Keywords: MODIS images - Mekong River - EVI 


1 Introduction 


In the last decade, satellite images such as Landsat, Sentinel, and MODIS have widely 
applied to investigate and manage water resources in river systems. This is because of 
satellite images allow to look in large spatial scale of the whole river system. Previous 
studies [1-4] also confirmed that data sources collected from satellite images at dif- 
ferent times are extremely valuable data sources, making disaster management to be 
more effective. Indeed, these data sources can be combined with ground surface 
measurements to allow for accurate presentations of the field of interest. 

In regards to satellite image processing, high performance computing systems are 
often applied with purposes of reducing computational time and of improving com- 
puting capacity. In this context, Google Earth Engine (GEE) has recently launched and 
allowed users in different fields and aspects to process quickly satellite images [4—6]. It 
should be noted that recent advances in high performance computing systems and 
parallel computing platform have allowed many useful and interesting processes to be 
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done in various topics when using satellite images. Here are some selected examples 
from the last decade, such as this load of generating annual map of open surface water 
body [6], estimating ecosystem production [3], producing crop yield and rice paddy 
maps [1, 4], and mapping inundated and flooding water extent as well [1]. 

The main objective of the paper is to apply the reconstructing global MODIS EVI 
time series, which is obtained by using MODIS13A1 EVI time series collected in the 
period from 2000 to 2017 and performed on the GEE, for extracting water area within 
lowland region of the Mekong River. Specific objectives of the study are to (i) inves- 
tigate the spatial distribution and temporal change of inundated area within the Mekong 
Delta in the Vietnam and lowland area in the Cambodia for the whole year 2000 when 
using the smooth MODIS EVI time series and (ii) examine the relationship between the 
extracting water area and the observed water elevation at the hydrological stations in 
the study domain. 


2 The Study Domain 


The Mekong River is known as one of the largest rivers in the world. The river 
originates from the eastern Tibetan Plateau (China), and then flows through six 
countries of Southeast Asia (i.e. China, Laos PDR, Myanmar, Thailand, Cambodia, and 
Vietnam) before discharging into the East Sea of Vietnam (Fig. 1). The Mekong River 
meanders over the 4,900 km and its catchment area covers of about 795,000 km?, with 
the mean annual river discharge at Kratie is 13,600 m°/s [7]. The weather in the river 
basin is characterized by the Western North Pacific and Indian monsoons, with a wet 
season from June to November and a dry season from December to May. The mean 
monthly temperature (at Phnom Penh station) varies between 26 °C and 31 °C while 
annual evapotranspiration ranges from 1000 mm to 2000 mm, with high relative 
humidity. The mean annual rainfall ranges between 1200 mm and 3000 mm, with 
double peaks of rainfall in most lowland regions during wet years. Hydrological 
characteristics vary significantly within the river basin, especially in the lowland region 
of the river because of combined impacts of anthropogenic activities and natural dis- 
turbances such as significant variation of weather, El Nino, typhoons and tropical 
storms, hydropower development, climate change and sea level rise. 

Regarding lowland region of the Mekong River, which extends from Kratie in 
Cambodia to the whole floodplains in Vietnam, the river length is about 700 km and 
the area is roughly about 130,000 km’. In this region, the Tonle Sap Lake is an 
important water body, located in the Cambodian floodplain which contributes largely to 
the flow processes in the Mekong River. The Tonle Sap Lake consists of a permanent 
lake itself, twelve tributaries, extensive adjacent floodplains and the Tonle Sap river. 
The latter is linking the Tonle Sap Lake to the Mekong River. The Tonle Sap river with 
a length of approximately 120 km is situated at the Southeast end of the Tonle Sap 
Lake and joins the Mekong River at Chaktomuk confluence. At the Chaktomuk con- 
fluence, the Mekong River splits into the Bassac river in the West and the Mekong river 
in the East. 
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Fig. 1. The lowland region of the Mekong River 


Note that the lowland region of the Mekong River, home to more than 60 million 
people, is the most important region in terms of agricultural and inland fishery pro- 
duction in both Cambodia and Vietnam. For example, Vietnamese Mekong Delta is 
farmed intensively with paddy rice (~1.9 million ha) and contributes about 50% 
(~ 23 million tons) of Vietnam’s rice production in 2015 [9, 10]. However, the region 
is under threat from a combination of climate change, hydropower development, rising 
sea level, land surface subsidence and hydropower activities, resulting in various 
societal issues related to drought and flood disasters, water resources management and 
environmental protection. Therefore, management of water resources within this region 


200 C. Pham Van and G. Nguyen-Van 


under climate change and significant variation of anthropogenic activities is critical not 
only for understanding hydrological characteristics but also for developing suitable 
adaptation measures in the lowland region of the Mekong River while safeguarding its 
environment. 


3 Material and Method 


3.1 MODIS Imagery Data 


The MODIS images (i.e. MODIS13A1 EVI) with land surface reflectance 16-day and 
500 m spatial resolution are used in this study. Although MODIS13A1 EVI is com- 
posed by using the best available pixel value, cloud and cloud shadow still cover on 
most satellite images [5]. Thus, MODIS13A1 EVI images from 2000 to 2017 are 
collected and used for time-series reconstruction purposes. 


3.2 Imagery Processing 
General procedure of the methodology to 
extract the water area in the domain of 
C sn) interest is shown in Fig. 2. Firstly, 
pee eee oe ee ean ee MODIS13A1 EVI images are collected in 
the period from 2000 to 2017. Then, the 


Whittaker smoother method is applied for 
reconstructing time-series EVI based on 


te the computational platform of the GEE. 
BE sesoeer oins grsteg i Finally, spatio-temporal inundated water 
' - '= area and assessment of accuracy are done 
y ' by using the ArcGIS. 
Tim e-series of sm ooth MODIS EVI Among different smoother methods, 
k Whittaker smoother method [10] is 
[_ Spatotenponivatrarca | ', selected and applied to reduce noises 
' Spatio-tem poral water area 1B : : : : 
'& efficiently in time series of the 
+ ‘~~ MODIS13A1 EVI. Because this method 
i [__xeaneyessomet | is successfully applied to vegetation 
y phenology extraction, land cover classi- 
fication, hyperspectral remote sensing 
[5]. 
Fig. 2. General flowchart of the methodology It should be noted that the Whittaker 
to extract the water area smoother method is performanced based 


on three indicators, i.e. balance fidelity S, 
roughness R, and difference matrix Q. 
These indicators are computed as [5, 9]: 
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Fig. 3. Algorithm of the Whittaker 
smoother method (modified after [5]) 


sS= ly—2?= So Oi - 4) (1) 


i 


R = |Dz|’= S- (¢;—2g-1+z-2) (2) 


Q=S+iR (3) 


where y and z are the original and smoothed 
time series, respectively; 2 is the roughness 
parameter. 

When weight w is taken into account, 
Eq. (1) is rewritten as: 


S=l|y—2?= S— wily - zi)? 


= (y—z)'W(y—2z) (4) 


with W is the diagonal matrix of w, w (0, 1]. 
The detailed algorithm of the Whittaker 
smoother method is given in Fig. 3. 


3.3. Water Area Extraction 


In order to extract the water area in the 
domain of interest, we use the Enhanced 
Vegetation Index (EVI) for identification and 
classification of non-water and full-water 
pixels in each image. The EVI indicator is 
calculated as: 


EVI = 2.5 
NIR — RED 


* NIR +6 x RED —7.5 X BLUE +1 
(5) 


where NIR, RED and BLUE are the surface 
reflectance value of near infrared Band 2 
(841-876 nm), visible Band 1 (RED, 620- 
670 nm) and visible Band 3 (BLUE, 459- 
479 nm), respectively. If EVI value of a pixel 


is smaller than the given value é , this pixel is set as the fully water pixel. On contrast, 
if EVI value of a pixel is larger than the given value é,, the pixel is marked as the non- 
water pixel. A given value ¢ = 0.05 is used in this study as in many previous studies 


[1, 3,10, 11). 
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3.4 Water Area and Water Elevation 


To analyze relationship between extracting water area and water elevation, a linear 
regression is used in order to render the examination of the relationship as simple as 
possible. This linear regression is given by 


y=axxt+b (6) 


where y is the water elevation at hydrological stations (i.e. Tan Chau in the Mekong 
Delta and Phnom Penh in the Cambodia region), x is the extracting water area, a and 
b are the regression coefficients. 


4 Results and Discussion 


4.1. Temporal Variation of Extracting Water Area 


Figure 4 shows the time-series of extracting water and non-water areas in year 2000 for 
the Mekong Delta in the Vietnam and lowland region in the Cambodia. It is clearly 
observed that small water areas are obtained as expected in the dry season period from 
February to June in both sub-areas of the lowland region of the Mekong River. A large 
value of extracting water area is archived in the wet season period. Maximum 
extracting water area occurs in September (see the left panel in Fig. 4) because this 
period is the flood times in the lowland region of the Mekong River. On the other hand, 
a high value of non-water area is obtained in the dry season for both Mekong Delta and 
Cambodia region, while a small value of non-water area is archived in the wet season. 
These results reveal that the variation of extracting water and non-water areas in the 
domain of interest is in line with weather characteristics. 
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Fig. 4. Time-series of extracting: water area (left panel) and non-water area (right panel) 


4.2 Spatial Distribution of Extracting Water Area 


The spatial distribution of extracting water and non-water areas at different given dates 
in year 2000 is shown in Fig. 5. The results clearly show that the spatial distribution of 
extracting water area is consistent with bed elevation. For instance, water occurs and 
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Fig. 5. Extracing water area at given dates in the domain of interest (continued) 
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Fig. 5. (continued) 


remains in the almost areas (where the bed elevation is low) such as main channel of 
the Mekong River, Tien river, Hau river as well as the Tonle Sap Lake in Cambodia. 
Regarding the Tonle Sap Lake, it is not supprised that water remains during the whole 
considered time and the water area in this lake varies depending on given instanta- 
neous. This trend is also observed in the Long Xuyen Quadrangle and Plain of Reeds. 
Most areas in the Long Xuyen Quadrangle and Plain of Reeds are inundated in 
September and October of year 2000. These results are quite similar to results obtained 
when using three different indices (i.e. EVI, LSWI, and DVEL) reported in the previous 
study [1]. 
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4.3 Relationship Between Extracting Water Area and Water Elevation 


Figure 6 shows the time-series of extracting water area and water elevation for the 
Mekong Delta in the Vietnam and lowland area in the Cambodia. Note that observed 
water elevation at Tan Chau and Phnom Penh stations (Fig. 1) is used for investigating 
the relationship between extracting water area and water elevation in the Mekong Delta 
and Cambodia region, respectively. As can be seen, the higher water elevation the 
larger extracting water area. 

Figure 7 depicts the extracting water area versus the water elevation at Tan Chau 
and Phnom Penh stations. A linear regression describing the extracting water area as a 
function of the water elevation is also presented. The correlation coefficient between 
the extracting water areas from MODIS EVI and the observed water elevation is 0.885 
for the Mekong Delta in the Vietnam and 0.924 for the lowland area in the Cambodia. 
These results reveal that the variation of extracting water area from MODIS EVI within 
the domain of interest is in line with the change of observed water elevation at 
hydrological stations, namely Tan Chau and Phnom Penh. 

In the lowland region of the Mekong River, extent of inundated area can vary 
depending on the water elevation and instantaneous, revealing that non-water, mixture 
and full-water pixels are inherently existed in each considered MODIS images. 
However, in order to render the performance procedure as simple as possible, we used 
only EVI indicator for identification and classification of full-water and non-water 
pixels within each image. Extracting water area resulting from that procedure shows a 
significant promise in relationship with the observed water elevation at hydrological 
stations (i.e. Tan Chau and Phnom Penh) even the mixture water pixels were disre- 
garded. Further investigation of the mixture water pixels will be considered in the 
future investigation effort for accurate presentation of inundated area in the Mekong 
Delta and lowland area in the Cambodia. 
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Fig. 6. Extracting water area and water elevation in the: Mekong Delta (left panel) and 
Cambodia region (right panel) 
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Fig. 7. Water area versus water elevation, for the Mekong Delta (left panel) and Cambodia 
region (right panel) 


5 Conclusion 


Using the reconstructing global MODIS EVI time-series, the extracting water area in 
the lowland region of the Mekong River was firstly carried out, and the results were 
correlate well with weather characteristics, observation of water elevation at Tan Chau 
and Phnom Penh stations for the high flow year 2000. The correlation coefficient 
between the extracting water area and observed water elevation was closely to the unity 
for both the Mekong Delta in the Vietnam and lowland area in the Cambodia. 
Secondly, the extracting water area from the reconstruction EVI time-series for the year 
2000 was also consistent with the variation of the spatial distribution of the topography 
as well as the hydrological characteristics. Finally, the results obtained in the present 
study are believed to be useful for assessment of water extend caused by floods as well 
as for studying inundation processes in the lowland region of the Mekong River when 
using hydrodynamic models. 
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Abstract. Palmprint is a new biometric feature for personal identification with a 
high degree of privacy and security. In this paper, we propose the palmprint feature 
extraction method which combines the direction-based method (Local line direc- 
tion pattern) and learning-based method (two-directional two-dimensional linear 
discriminant analysis ((2D)2LDA)) to get the high discriminant direction based 
features, so-called Discriminant local line Directional Representation (DLLDR). 
First, the algorithm computes the LLDP features with two strategies of encod- 
ing multi-directions. Then, (2D)*LDA is applied to extract DLLDR features with 
higher discriminant and lower-dimensional from the LLDP matrix. The exper- 
imental results on the public databases of Hong Kong Polytechnic University 
demonstrate that our method is effective for palmprint recognition. 


Keywords: Biometrics - 2DLDA - Palmprint - Discriminant local line 
directional pattern (DLLDP) 


1 Introduction 


Recently, palmprint has been increasingly studied and applied for personal recognition 
because of its advantages such as high performance, cost-effectiveness, user-friendliness, 
and etc. [1]. Low resolution Palmprint image could be used for recognition. The low 
resolution refers to 100 pixels per inch (PPI). Low resolution palmprint images contain 
features such as: principal lines (longest lines), wrinkles (weaker lines), and texture [2]. 
There are many methods exploiting these features, grouped into two categories such as 
subspace-based approaches and local feature based approachs [3-7]. 

Subspace-based approaches (PCA, LDA, ICA, ...) project palmprint images from 
high dimensional space to a lower dimensional feature space. The subspace coefficients 
are considered as features [8]. To overcome illumination, contrast, and position changes, 


© Springer Nature Switzerland AG 2020 
H. A. Le Thi et al. (Eds.): ICCSAMA 2019, AISC 1121, pp. 208-217, 2020. 
https://doi.org/10.1007/978-3-030-38364-0_19 


Palmprint Recognition Using Discriminant Local Line Directional Representation 209 


inputs are directional images [9-11]. Rida et al. [12] used both 2D-PCA and 2D-LDA 
for feature extraction and identification. 

Local feature representations commonly use dominant direction features and texture 
because that are insensitive to illumination changes [25]. Zhang et al. [13] used Gabor 
phase in a fixed orientation to encode the palmprint, called palmcode for palmprint recog- 
nition system. Kong and Zhang [14] proposed the dominant direction feature extraction 
method for palmprint recognition by using Gabor filters with different directions. Then, 
the improved dominant direction based methods are fusion code [15], DRCC code [16], 
and etc. Jia et al. proposed RLOC feature which is computed by using line-shape filter 
(MFRAT based filter) [17]. Sun et al. [18] used three grouped Gaussian filters with three 
orthogonal directions with the aim that describes multiline at each pixel. Fei et al. [20] 
computes a double orientation code by using two dominant orientations for palmprint 
recogntion. BOCV feature were proposed for palmprint recogntion, in which orientation 
features were encoded with all six orientations by using six Gabor filters [19]. The local 
line direction pattern representation jointly encoded by two optional directions is also 
proposed for palmprint recognition [21]. Zheng et al. [22] proposed DoN of palmprint to 
build the direction feature descriptor for recognition. Many orientation based methods 
were investigated in [23]. In general, the direction features are robust and discrimi- 
nant for palmprint recognition [24]. Li et al. [29] proposed Local Microstructure Tetra 
Pattern (LMTrP) for extracting palmprint features. Moreover, the modern deep convo- 
lutional neural network is also studied for palmprint recognition [26-28]. Recently, the 
approaches combined direction features and subspace based methods are deemed to be 
promising methods because of the following reasons: (1) Direction feature could be 
stable and robust against illumination. (2) Subspace-based methods compute the global 
discriminative features with low dimensions. Hoang et al. [9-11] have proposed some 
methods of combining the global and local features using the local directional feature 
and the linear discriminant analysis method. However, the local line directional pattern 
is reported to be distinctive and gives higher accuracy than the dominant directional 
feature [1, 21]. 

This paper proposes a novel Discriminant local line Directional Representation 
for palmprint recognition by combining local line discriminant pattern (LLDP) and 
two-directional two-dimensional linear discriminant analysis ((2D)*LDA), so-called 
DLLDR. First, the algorithm computes the LLDP features with two strategies of en- 
coding multi-directions. Then, (2D)*LDA is applied to compute DLLDR features with 
lower-dimensional and higher discriminant. The experiments on the public databases 
from Hong Kong Polytechnic University demonstrate that DLLDR is a complete and 
robust direction representation for palmprint recognition. 

In the following sections, we present in detail our proposed algorithm of palmprint 
recognition. Section 2 presents our proposed method. Experimental results are presented 
in Sect. 3. Finally, the paper conclusions are drawn in Sect. 4. 


2 Our Proposed Method 


The local line directional pattern (LLDP) is insensitive to illumination change and is 
more discriminative than the dominant directional code. However, this feature has two 
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ways to represent the potential directional code: (1) using the direction of the brightest 
and darkest lines, (2) using the direction of the darkest line and the second darkest line. 
To exploit the distinctiveness of these two features, our proposed method first computes 
LLDP features with these two encoding strategies. Then, we apply the (2D)*LDA method 
to reduce the dimensional number of two LLDP feature maps and eliminate the less 
distinctive information. Therefore, in this section, we present the proposed method in 
detail including LLDP, (2D)*LDA method and combination scheme. 


Al 
LLDP,> (2D)?LDA 
TLDEss | QDPLDA 


Fig. 1. The scheme of the proposal method. 


2.1 LLDP 


LLDP descriptor uses the index numbers of line directions to compute the feature code. 
There are three ways to build LLDP code [21]. 


S1: The directional bits of minimum k line magnitudes {m;}, (i = 0,1,..., K) are set 
to | and the remaining bits are set to 0, as: 
K 0,a>0 
LLDP, =), bi(mi —mj)/2!, bi(a) = | ie, (1) 


where my, is the k-th minimum magnitude. K is the number of consider directions. 
S2: The indexes of the first and the second minimum line magnitudes, t)2 and ft); are 
used as: 


LLDP = ty x K' +t x K® (2) 


S3: The index numbers of the minimum line response t)2 and the maximum line response 
t; are used as follows: 


LLDP = ty x K' +t, x K® (3) 

The lines could be computed by MFRAT or Gabor filter bank. In an image, given 

a square local area Zp, whose size is p x p, MFRAT computes magnitues of different 
lines {m;}, i = 0, 1, ..K) at the pixel (xo, yo) as: 

mi = yo ver, £9) (4) 


Li ={, y): y = Six — x0) + yo, x € Zp} (5) 
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where f(x, y) is gray value at (x, y), L; is the set of points built a line on the Zp, and 
i means the index number of a slope of S;. 

Given an image J, Gabor filter bank can be applied for detecting the lines 
{mj}, G =0,1,..., 12), located in (x, y) as follows: 


m; = (I * G(x, y,6,,0)), 


6, = *) i =1,2,.., 12 
_ ii x+y? 
G(x, y, 9, h,o) = Ano2 oXP ~ Io2 IQ, y, 6, LL), 


f(x, y,9, £) = exp{2mj (uxcosé + wysiné)} 


(6) 


where j = /—I, w is the frequency of the sinusoidal wave, @ controls the orientation 
of the function and o is the standard deviation of the Gaussian envelop. 

LLDP has three strategies for coding line directional patterns. However, strategy 
2 can represent strategy 7] with k = 2, so we only choose two strategies to get the 
candidate patterns in order to fully exploit the distinctiveness of the palm lines. That is 
strategy 2 and strategy 3. With these two strategies, LLDP created by the darkest, second 
darkest and least dark lines which are stable and clear lines, and effect to the accuracy 
of recognition. 


2.2 (2D)?LDA 


(2D)°LDA is applied for reducing the dimension of LLDP matrix. Suppose {Ax}, k = 
1...N are the LLDP matrices compute by formula (2) with strategy s2 (or s3) which 


belong to C classes, and the j’” class C; has n; templates (oe ;ni=N ). Let A is the 


means of the registration set, and ‘A; is the means of i! ” class. 


T T ryt T T —ay\T IE 
ac=|(a\?) AA?) eae”) Ai =| (at") An) sual ae’) 
T T — ew TE 
a=|(a) (4) mene) ie 


(7) 
where Ay, ) A, ) A” are the j'" row vector of Ax, Ay and A, respectively. 2DLDA 
compute a set of optimal vectors to find the optimal projection matrices: 


X = {x1,x2,..., Xa} (8) 
by maximizing the criterion as follows: 


T GX 
X'G} 
XTGx 


=o, (ai GP = AO) a ae) (10) 
ee oe. a aa ; (4)? - aay (4? _ A”) (11) 


J(X) = (9) 
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where JT is matrix transpose, Gp is between-class matrix, Gy, is within-class scatter 
matrix. Therefore, X is the orthonormal eigenvectors of G;,'Gp corresponding to the 


d largest eigenvalues A1,..., Aq. The value of d is selected based on the predefined 
threshold 0 as follow: 
d 
are 
dict 1 > ) (12) 


2DLDA described above works in the row-wise direction to learn an optimal matrix 
X from a set of training LLDP matrices, and then project an LLDP matrix Aj.» onto 
X, yielding m by d matrix, i.e. Yinxg = Amxn-Xnxd- Similarly, the alternative 2DLDA 
learns optimal projection matrix Z reflecting information between columns of LLDP 
matrices and then projects A onto Z, yielding a q by n matrix, i.e. Byxn = Zz . 4 Amxn:- 
Suppose we have obtained the projection matrices X and Z, projecting the LLDP matrix 
Amxn onto X and Z simultaneously, yielding a g by d matrix D: 


D=Z'.A.X (13) 


(e1) (e4) (es) 


Fig. 2. Results of LLDP with strategy 2 and (2D)?LDA: (a) original palmprint image, (b) LLDP 
image, (d1)—(d5), (el1-e5) some reconstructed images of the LLDP image with (cl)-(c5) d = 10, 
15, 20, 25, 50 and q = 64, (d1)-(d5) d = 64, q = 10, 15, 20, 25, 50, (e1)-(e5) q = d=10, 15, 20, 
25, 50, respectively. 


The matrix D is also called the discriminant local line directional representation 
matrix (DLLDR) for recognition. 
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2.3 Discriminant Local Line Directional Representation 


The input of our proposed algorithm is palmprint ROI image: /. Figure 1 demon- 
strates the proposed the method. The processing steps for extracting DLLDR feature 
are summarized as follows: 


— Step 1: Compute the LLDP with strategies 2 to get A’ matrix by using formula (2). 

— Step 2: Compute the LLDP with strategies 3 to get A? matrix by using formula (3). 

— Step 3: Based on (2D)*LDA, compute the DLLDR feature D! by applying Eq. (13) 
to the A! feature matrix to get D/. 

— Step 4: Based on (2D)°LDA, compute the DLLDR feature D* by applying Eq. (13) 

to the A” feature matrix to get D’. 

Step 5: The combined feature matrix {D', D*} is DLLDR of the input image: /. 


(e1) (e2) (e3) (eq) (es) 

Fig. 3. Results of LLDP with strategy 3 and (2D)?LDA: (a) original palmprint image, (b) LLDP 
image, (d1)—-(d5), (el-e5) some reconstructed images of the LLDP image with (c1)-(c5) d = 10, 
15, 20, 25, 50 and q = 64, (d1)—(d5) d = 64, q = 10, 15, 20, 25, 50, (e1)-(e5) q = d=10, 15, 20, 
25, 50, respectively. 


Given a query image J, apply the proposed method to get DLLDR feature 
D: { D, D?}, and apply our method to all the training images to get the DLLDR feature 
matrix Dy(k = 1,2,..., N). The Euclidean distance is used to compare two features. 
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The distance between D and Dy is defined by: 


d(D, Dx) = ID — Dil aa 
score(D, Dz) = 1 — d(D, Dx) 


The d(D, Dx) is between 0 and 1. The score of perfect match is 1. 


3 Experimental Results 


We evaluate the proposed method in comparison some methods (LLDP [21], RDORIC 
[10]) on the PolyU 3D database of Hong Kong Polytechnic University [30]. These 
methods were implemented using C# on a PC with a CPU Intel(R) Core(TM) i3-3110 M 
@ 2.4 GHz and Windows 7 Professional. In the PolyU 3D database, there are 400 
different palms. The twenty images from each of palms were captured in each session. 
The time interval between the two sessions is about 30 days. Each sample contains a 
3D ROI (region of interest) and 2D ROI at a resolution of 128 x 128 pixels. In our 
experiments, we use 2D-ROI database in which the resolution of these ROI images is 
64 x 64 pixels. In identification, a query compares to all templates in training set to 
select the most similar template as result. In verification, each image in the query set 
is compared with all templates in the registered set to generate incorrect scores and 
correct scores. The correct score is the maximum of the scores created by the query and 
templates from the same registered palm. Similarly, the incorrect score is the maximum 
of the scores created by the query and all templates of the different registered palms. If 
the query does not have any registered images, we only obtain the incorrect score. If we 
have N queries of registered palms and M queries of unregistered palms, we obtain N 
correct scores and N + M incorrect score. We get the verification results: the receiver 
operating characteristic (ROC) curve. Similar to the number of employees in small 
and medium-sized companies, we set up two experiments with dataset | and dataset 
2 with N = 100 and 200. In dataset 1, the training database contains 500 templates 
from /00 random different palms, where each palm has five templates. The testing 
database contains 1/000 templates from 200 different registered palms. In dataset 2, 
the training database contains /000 templates from 200 palms registered. The testing 
database contains /000 templates from 200 registered palms. Therefore, there are 500, 
1000 correct identification distances and /000, 2000 incorrect identification scores for 
N= 100, 200, respectively. None of samples in the testing datasets is contained in any 
of the training datasets. Table | presents these parameters of our experiments. Table 2 
represents the recognition accuracy of our method in comparison with others. The ROC 
curve illustrating the verification performances of our method and others are shown in 
Fig. 4. From this group of figures and tables, we can see that the recognition accuracy 
rate of our method is higher than the state of art methods (RDORIC [11], LLDP [21]). 
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Table 1. Parameters of databases in recognition experiments. 


Dataset | Each class All class Number of identification 
Training | Testing set Training | Testing set 
-_ Registration | Unregistration set Correct | Incorrect distance 
set set distance 
1 5 5 5 500 500 + 500 = 1000 500 500 + 500 = 1000 
2 5 3 5 1000 1000 + 1000 = 2000 | 1000 1000 + 1000 = 2000 
Table 2. Rank one identification in our experiments. 
Method Dataset 1 Dataset 2 


Recognition rate 
(%) 


Test time for a 
query image (ms) 


Recognition rate 
(%) 


Test time for a 
query image (ms) 


RDORIC [10] 97.80 147 97.67 204 
LLDP [21] 98.80 352 98.70 526 
Our method 99.60 153 99.30 275 
(DLLDR) 
100 100 
¥ 99 ¥ 99 
g g 
« Pa 
z 98 z 38 
es — -=— LLDP [21] i; — «= LLDP [21] 
# fa 
| — — =RDORIC [11] 5 — — =—RDORIC [11] 
5 96 5 96 
a —— Our method © —— Our method 
95 95 
0 50 100 0 50 100 
False Accepted Rate (%) False Accepted Rate (%) 
(a) (b) 


Fig. 4. The ROC curves of our proposed method and other methods with dataset 1 (a), dataset 2 


(b), respectively. 


4 Conclusion 


This paper proposes a novel technique called Discriminant local line Directional Rep- 
resentation for palmprint recognition which combines LLDP and (2D)*LDA. First, 
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the algorithm computes the LLDP features with two strategies of encoding multi- 
directions. Then, (2D)*LDA is applie to extract the DLLDR features with lower dimen- 
sion and higher discriminant. Experimental results on two palmprint datasets of 3D 
PolyU database show that the proposed method achieves the best results in comparison 
to the state of art methods. 
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Abstract. Identifying similarities between audio signals is an important goal for 
the Speech Recognition Systems. This paper proposes a new method for the assess- 
ment of the similarity between two audio signals. Based on the entropy theory, the 
audio signals are segmented and compared on the basis of the information value 
carried by each segment: this value is obtained considering an alphabet derived 
from the trend lines of each audio signal. It is an experimental method that per- 
mits to represent and analyse the results through an isometric diagram, for a better 
interpretation of them. 


Keywords: Entropy - Information theory - Sound similarity - Speech assessment 


1 Introduction 


The voice recognition system is a part of everyday life, so that it can be considered a 
constant presence in any activity. 

Speech recognition is a “young” technology and currently under development and 
improvement. The challenges that this technology presents are numerous but are grad- 
ually being reduced thanks to numerous research in this field. Listening to and under- 
standing what a person says is so much more than hearing the words the person speaks. 
Each person is different from another person and pronounces the same words differently: 
it depends on many factors such as health or emotional state. Machines are learning to 
“listen” to accents, emotions and inflections [1, 2], but there is still work to be done in 
this area and improvements still need to be made. 

There are different methods used for the speech recognition and these include: 


e Hidden Markov Model method [3-6] in which the observation is linked to the state 
through a probabilistic function; 

e Neural Network method [7-10] which uses systems consisting of interconnected 
computational nodes working somewhat similarly to human neurons. 


The research has invested and is still investing in this field in order to improve voice 
assistants and guarantee better services to people. The everyday situations in which a 
person usually uses the voice assistant are many and different, and it is normal that in 
similar situations the person, who may be engaged in other activities (for example, he 
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is driving or, more generally, has hands busy), pronounces the words incorrectly or too 
hastily. It may also happen that if the person is in particularly noisy environments the 
recorded audio has imperfections and discontinuities, or it may happen that the person 
who uses the speech recognition system has speech or language defects. The main task 
of a speech recognition system is not limited to the simple detection of what has been 
pronounced, but, to guarantee an optimal functioning, it carries out a series of analysis 
on the voice, on the speech and on all the recorded audio in order to better interpret 
and eventually correct what was said. One of the essential concepts to be considered 
in speech recognition is the concept of similarity: only in this way is it possible to go 
beyond the problems mentioned above. 

This paper illustrates a method to analyse and evaluate the similarity between two 
voice patterns considering them not as separate entities but as belonging to the same 
pattern. The method is based on the calculation of the information content carried by each 
voice pattern so as to assess the similarity between two sound waves and, consequently, 
be able to detect phoneme mispronunciations. 

This paper is organized as follows. 

A brief introduction to the concept of similarity is presented in Sect. 2. Section 3 
explains the method used to assess the similarity, and the information theory (as well as 
the properties related to the information quantification). In Sect. 4, the effectiveness of 
the method is illustrated through some experimental results Finally, Sect. 5 concludes 
this paper with the brief discussion. 


2 What Is Similarity? 


Every day, a lot of information reaches each person, but what is the use of reading, 
listening or watching the media if we do not systematically proceed with an analysis of 
the news? How to understand and choose this flow of information? 

The analysis becomes a tool that allows a defence against such a vast flow of infor- 
mation: it allows to filter the information [11]. Information takes on meaning from the 
comparison with something already known by a person. The comparison is an intuitive 
operation, almost automatic for a person: for example, while listening to a piece of 
music, the person continually makes a comparison between what he is listening to and 
what he has already heard. The analysis is able to define the structural elements of a 
message only through a comparison operation. 

It is in this context that the concept of similarity assumes importance. 

This concept has been studied from different perspective, such as cosine coefficient 
[12], information content [13-15], mutual information [16-18] and distance-based mea- 
surements [19-21]. A common feature of all these methods is that each of them is related 
to a particular context or assumes a particular domain model. 

The difference between the concept of similarity and the concept of identity is very 
subtle: generally, two entities (A and B) can be defined as similar if they have some 
features in common, but not necessarily all [11]. Therefore, the similarity between A 
and B depend on: 


e their common characteristics (the more they have and the more they are similar), 
e their differences (the more they have and the less they are similar). 
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On the basis of these considerations it emerges that to compare two entities (A and B) 
it is necessary to carry out a segmentation of these entities in order to identify common 
characteristics or dissimilarities (Fig. 1). 


A B 
su-pe-ra-re su-pe-rio-re 


Unique toA Unique to B 


Common elements 


Fig. 1. Representation of two entities that each contains its own unique features and also contains 
common features. 


3 The Method 


The method presented in this paper aims to assess the similarity between two speech 
audio messages (finding common characteristics or dissimilarities) using the information 
content method. In the following examples two audio files have been considered, with 
the pronunciation of the word “SUPERARE” (the Italian word for “overtake”’) by two 
different people: a 6-year-old child and a 10-year-old child. 

To compare the audio messages between them, each audio message is segmented into 
syllables (using KALDI an open-source toolkit [22]) and then the entropy of each syllable 
is calculated. This calculation necessarily implies the reference to a specific alphabet. 
The alphabet is a set of symbols that characterize a language [23] and are associated 
with it [24-27]: their different combination permits you to construct a message. Thus, 
in the transmission of a message the meaning changes according to the probability that 
some symbols are transmitted, as it may be immediately inferred from the formula (1). 


n n 1 
H(X)= EUG) = Di F@D ¢ PG) = D7, PGi © lo BS) 


The following procedure is used to obtain the alphabet from each syllable of the 
audio message: 
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(1) extraction (from a Waveform Audio File) of the values of the single samples (in 


csv format), 
(2) elimination of negative values, 


(3) representation of the Line Chart of the remaining positive values, 
(4) visualization of the 6" order polynomial trend line for the positive values (Fig. 2), 


RE 


Fig. 2. 6" order polynomial trend line. 


(5) compilation of the alphabet table (Table 1), considering the trend of the line (A = 
Ascending, D = Descending, S = Stable if the curve variation is less than 10%) 
and the height difference between two consecutive points; the height of each point 
is defined proportionally with respect to the maximum value 10. 


Table 1. Example of Alphabet Table: the first column indicates the height of the point, the second 
column indicates the trend of the graph line and the third column shows the number of trends of 
this specific type identified in the audio message (considering the trend line). 


Height | Trend | Number 
0 A 
D 
1 A 
D 
2 A 
D 
S 


After defining the alphabet, a transition matrix is created (Table 2) so as to calculate 
the entropy of each syllable: it is necessary to take into account the manner in which 
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the trend of the Line Chart (between two points) succeeds one another within the audio 
message. Whit the aim of doing this, the Markov’s stochastic process is used: continu- 
ous sequence of states of a process in which the probability of passing from one state 
to another in a unitary time depends probabilistically only on the state immediately 
preceding it and not on the overall “history” of the system. 


Table 2. Example of transition matrix for the word represented in Fig. 2. 


0 fl 2 3 4 5 6 7 8 9 10 S 
D|A|D/A|D|A/D D|A|D A|D/)/A/|/D|A)D D|A|D 

OA 1 

D 1 1 2 
1 /A 2 

D 1 4 
2 6A 1 

D 1 
3 A 1 1 

D 1 1 1 2 
4 A 1 2/1 

D 
5S A 1 

D 
6 1A 2 

D 
7 A 

D 
8 A 1 

D 
9 A 

D 
10|A 

D 
S 1 1 1 1 1 1 2 3 


Only one transition matrix is created for both the two speech audio. 

By means of the alphabet table and the transition matrix it is possible to calculate the 
informative value of each syllable and define the information slots, each representing a 
specific range of values. 
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Figure 3 shows an example of analysis of two audio messages and the respective 
representation with the isometric diagram. In the upper part of the diagram is represented 
(through colours) the information value of each single syllable of the first audio message 
and in the lower part of the diagram the information value of each syllable of the second 
audio message. The colour representing a specific information value will have a darker 
or lighter shade depending on whether it is a greater or lesser value within one of the 
information ranges. In the event that (case where) the two syllables compared have the 
same information value, on the diagram there will be a column with a single colour, 
corresponding to one of the information intervals previously defined (see Table 3). On 
the contrary, if the information values of the two syllables are different, on the diagram 
there will be a column with a colour that changes, passing from the colour of the first 
syllable to the colour of the second syllable. 


Table 3. Information value of the syllables. 


Syllable Model Sample 
0,216096405 0,216096405 
0,528320834 0,528320834 
SU 0 0 
0,528320834 0,507650812 
0,5 0,5, 
0,216096405 0,216096405 
0,528771238 0,528771238 
PE 0,430827083 0,430827083 
0 0,528320834 
0,464385619 0,5 
0,133048202 0,410544839 
0,464385619 0,5, 
RA 0,298746875 0 
0,5, 0,430827083 
0,5 0,464385619 
0,375 0,375 
0,528771238 0,528771238 
RE 0,464385619 0,430827083 
0,430827083 0,5, 
0,5 0,5 
Rangel Range 2 Range 3 
0-0,2 0.2 — 0,4 0.4 — 0,6 
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SU PE RA RE 


Fig. 3. Isometric analysis of two audio messages. 


4 Experimental Results 


The analysis method described in this article was tested using a specially developed 
algorithm, which has no parameters to set for analysis. 
The speech audios were recorded considering some technical characteristics: 


e audio: mono signal; 

e sample rate: more than or equal to 44.100 (for a superior description of the audio 
signal); 

e audio file format: wave or aiff (uncompressed audio formats). 


Before starting the tests, a list of (30) specific words was created (to be used for test- 
ing): words with high recognition error rates and composed of (at least) three syllables. 


Table 4. Excerpt of the word used for testing. 


Word Number of syllables | Language 
Ag-nos-tic 3 English 
En-vi-ron-ment | 4 English 
Ham-bur-ger 3 English 
Italian 
Ol-tre-ma-re 4 Italian (word for “overseas”’) 
Ir-reg-u-lar 4 English 
Scen-de-re 3 Italian (word for “descend”’) 
Sig-na-ture 3 English 
Sti-ra-re 3 Italian (word for “iron”’) 
Stre-go-ne 3 Italian (word for “shaman’’) 
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The choice was made on Italian and English words (15 Italian words and 15 English 
words), considering words that require a different articulation of the vocal organ (see 
Table 4). 

The tests involved 2 different speakers, for the British pronunciation and the Italian 
pronunciation, and 8 people aged between 12 and 20. 

First of all, the words spoken by the speakers were recorded; then, participants were 
tested one at a time. They sat in front of a microphone and listened to the auditory stimuli 
on headphones (where it was reproduced the recording realized with the two speakers): 
after listening three times a word, they had to repeat the same word. 

The results of the analysis of the recordings supplied important information: 


(1) the difference in intensity in the pronunciation of the same word by two people 
is not an important discriminant: as we can see in the example of Fig. 3, the first 
syllable “SU” has two different intensities in the two recordings (see the trend 
between points 3 and 4), and in the corresponding isographic diagram the columns 
show small inflections of colour, not very evident; 

(2) words that begin with “h” show a high degree of discrimination: the “h” sound is 
considered like the noise made by the air when it passes through the glottis, without 
the vocal cords stretched; 

(3) words containing the letter “r’ show a high degree of discrimination: the lack 
of ability, or difficulty in, pronouncing the sound “tr” (Rhotacism) influences the 
recognition of the syllables of a word and consequently the analysis. In cases where 
the words have a maximum of three syllables, often the registration is recognized 
as different and not similar. 


Out of a total of 30 subjects, the algorithm successfully discriminated 87 cases, which 
means a matching rate of 72%. 


5 Discussion and Conclusions 


The overall goal of this study is to realize an automatic system able to detect pronuncia- 
tion defects based on voice audio recordings. The method requires to consider two audio 
recordings of the same word: the first one with the correct pronunciation and the second 
one with a generic pronunciation. Using the information theory method, it is possible 
to highlight, through an isometric diagram, the points in which there is a pronunciation 
defect. 

It is important that this study be viewed more as a demonstration of an approach than 
as the presentation of an absolute method. The trend-line technique is very promising as 
a tool for assessing phonetic similarity. One of the main advantages of this technique is 
that it allows the analysis of words in different languages (Italian, English, French, Ger- 
man,...): algorithm corrections are not required. At the same time, the results expressed 
through the isometric diagram allow an easy reading and identification of the gravity of 
the pronunciation defect (on the base of the colour). This method could be a useful tool 
for a speech therapy analysis, for the analysis of the young children pronunciation. 
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However, there are some limitations to consider: the need for words to be recorded 
in the absence of noise and the need to approximate the interpretation of the results due 
to the poor precision of automatic spelling systems. 

Concerning future work, in addition to trying to improve the spelling of audio mes- 
sages, the research field should consider the distance between two points in a trend line. 
From this point of view, the results presented in this paper are only preliminary. 
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Abstract. This paper studies the problem of deep joint-clustering using 
auto-encoder. For this task, most algorithms solve a multi-objective opti- 
mization problem, where it is then transformed into a sing-objective 
problem by linear scalarization techniques. However, it introduces the 
scaling problem in latent space in a class of algorithms. We propose an 
extension to solve this problem by using scale invariance distance func- 
tions. The advantage of this extension is demonstrated for a particular 
case of joint-clustering with MSSC (minimizing sum-of-squares cluster- 
ing). Numerical experiments on several benchmark datasets illustrate the 
superiority of our extension over state-of-the-art algorithms with respect 
to clustering accuracy. 


Keywords: Clustering - Deep learning - Auto-encoder - Spherical 
distance 


1 Introduction 


Clustering is an important task in data mining with the aim of segmenting 
data-points into groups which pose similarity. Despite decades of research, clus- 
tering high-dimensional datasets is still a difficult problem due to the “curse 
of dimensionality” phenomenon. In general terms, it is a “widely observed phe- 
nomenon that data analysis techniques (including clustering), which work well 
at lower dimensions, often perform poorly as the dimensionality of analyzed data 
increases” [23]. Using dimension reduction techniques is a popular way to over- 
come this problem, where the original features are presented by a smaller but 
informative set of features. Among them, auto-encoders are recently considered 
and are often referred to as “deep clustering” in the literature. They have demon- 
strated a significant improvement in clustering high-dimensional data, which are 
currently the state-of-the-art methods for clustering. 

There are several works focus on developing deep joint-clustering algorithm 
with different clustering techniques such as K-means [10, 22,25, 28,30], sub-space 
clustering [13,31,32], robust continuous clustering [20], soft-assignment clus- 
tering [10,11,27], hierarchical clustering [29], etc. They can be classified into 


© Springer Nature Switzerland AG 2020 
H. A. Le Thi et al. (Eds.): ICCSAMA 2019, AISC 1121, pp. 231-242, 2020. 
https://doi.org/10.1007/978-3-030-38364-0_21 


232 B. Tran and H. A. Le Thi 


two main types: (1) simultaneously minimizing the auto-encoder reconstruc- 
tion and clustering in a joint framework by linear scalarization techniques (i.e. 
min Fiort — PAR 4 ) prelustering) (19, 13,20,22,25,28,30-32], or (2) progres- 
sively updating neural network mapping and clustering assignment in order to 
match a target distribution/pseudo label, which is updated during the optimiza- 
tion process [4,8,10,11,27,29]. Generally speaking, it can be viewed as solving 
an optimization problem or a sequence of problems where each has the form 
similar to (1). 

However, the derived problem in several works is not well-defined. Since 
clustering is applied to data’s representation in the latent space, the cluster- 
ing objective is affected by the scaling whereas the reconstruction objective is 
not. This affects deep clustering algorithm that realized on Euclidean distance 
or £, norm such as K-means (MSSC) [10, 15, 22, 25,28,30], latent space cluster- 
ing [13,31,32]; or integrating additional tasks on the latent space depends on 
Euclidean distance/é, norm [5,6,12,21]. To solve the mentioned problem, one 
could consider regularization techniques for neural networks such as regulariza- 
tion for ||6°— ||, or || fz(@z, 7:)||p [12]. However, it introduces another trade-off 
parameter that needs to be tuned/searched, which is not encouraged in the unsu- 
pervised setting. Hence, instead of using Euclidean distance (or other £, norm 
instead of £2), we could choose another distance function that is invariance to 
scaling. In this direction, cosine distance has been extensively used, especially for 
clustering document clustering [7]. Similarly, [1] introduces ¢:-layer to project 
data points’ representation onto a unit-hypersphere. From the context of deep 
clustering, this approach is roughly similar to use the cosine distance directly in 
the latent space of the auto-encoder, which improves clustering accuracy signifi- 
cantly over other regularization methods (such as Batch Norm and Layer Norm). 
Hence, motivated by the success of cosine distance in deep clustering, this paper 
considers two variants of cosine distance. 

Our contributions in this work are to: 


— Propose a solution to the scaling problem in deep joint-clustering, and instan- 
tiate a specific algorithm for the deep joint-clustering using MSSC. 

— Conduct numerical experiments for high-dimensional large-scale datasets with 
several recent clustering methods to study the quality of the proposed algo- 
rithms. 


The rest of this paper is organized as follows. Section 2 introduces the limi- 
tation of existing deep joint-clustering problems by auto-encoder. Section 3 and 
outlines our proposed solution, with a specific application for the deep joint- 
clustering by MSSC (Sect. 3.2). The numerical experiment on real-world datasets 
is reported in Sect. 4. 
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2 Scaling Problem in a Class of Deep Clustering 
Algorithms 


2.1 Auto-Encoder 


The standard stacked auto-encoder consists of two components: an encoder and 
a decoder. An encoder f” (reps. decoder f”) is a neural network parametrized 
by parameter 0% (reps. 0p), maps data from R™ — R¢ (reps. R¢ > R™). Given 
a raw input data point 2; € R™ (i.e. an image, a document, etc.), the encoder 
first produces code z; € R@, then the decoder reconstructs @; of x; only from 
code z; (Fig. 1 illustrates the corresponding auto-encoder). By reconstructing the 
input, the network can learn the under-complete but informative representation 
from the raw input data. Mathematically, given a data-set {x;}i<1,..... where 
az; € R™, the problem of auto-encoder reconstruction can be defined as 


min {Peto = S > e(ai, fF? (Oo, (05.2) ) (1) 


9%,0D ay 


where ¢ measures the reconstruction error. Common choices for £ are mean 
squared error, binary cross-entropy, £; norm, etc. In this work, we consider the 
square error ¢(x,y) = ||x — y|J3. 


Input Encoder f# Code Decoder f? Output 
x z= f*(x) R= f(z) 


‘i 


Fig. 1. Illustration of an auto-encoder. 


The autoencoder f4” = f? o f” can be seen as a neural network of L 
layers, where f? = fi) 0---0 f and f® = fMo--- 0 f™. Each function 
f© represents a layer of the neural network, which maps the output of the 
previous layer z“—)) into new code 2 = h(z,0%-))) where h is the activation 
function. A typical choice for h is linear function (hjinear(z, 9) = 0(z,@)) or non- 
linear function such as ReLU [18] (Aretu(z, 8) = max(¢(z, @),0), where max is 
a element-wise operator), where ¢(z,@) is the matrix multiplication operators 
(dense layer in neural network) or convolution operation (convolution layer in 
convolution neural network). 


2.2 Scaling Problem of Joint-Clustering by Auto-Encoder 


Let us consider the problem of deep joint-clustering by MSSC. Several works 
(10, 15, 22,25, 28,30] combines auto-encoder with MSSC into an optimization 
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problem by using the linear scalarization technique of multi-objective program- 
ing. Formally, this problem is defined as 
min FIMSSC (9, Op,u, 8) := FA®(6z,0p) +AFP™M8S (On, u,s) (2) 


On,0p,u,s 


= oll-7 Op, F(On,2)) IB + Aus — FP Or, 20)12 
s.t. mee, for i=1,...,nandj=1,...,k, 

yee for i=1,...,n, 

J 


or 
» min FIMSSC (95, 0p,U, 8) = FA" (On, 0p) + AFBIEM5SC (95, u) (3) 
E,’p,u 


=p (I: - ae Op, f"(Ox,%:))\2 + n min |lu — f (6p, 12) 


where controls the trade-off between two terms; u € R**™ are k centroids in 
latent space. 

However, problem (2) and (3) both suffer the scaling problem in latent space, 
which is explained as following. Let / is the bottle-neck layer of the L-layer auto- 
encoder, i.e. f& = fo fo... 0 Ff and fP = fH o---o f+), where 
f is the transformation function of i-th layer /block. In modern auto-encoder 
in general and in deep clustering in particular, f and f+” are linear function 
or sometimes are nonlinear activation function (such as ReLU): 


ReLU: FOE, 2 ») = ReLU(M 2), 
Linear activation: f (pti), x(e-1 )) = 9 z6-D 


for i € {l,l —1}, and z™ is the output of the i-th layer. For these networks, 
F4® is invariance to the scaling of 2 = = f"(0z,2;), but the 


second term FMSSC scale quadratically to zi! ) This problem affects deep clus- 


tering algorithm that realized on Euclidean distance or &) norm such as K- 
means (MSSC) [10,15,22,25, 28,30], latent space clustering [13,31,32]; or inte- 
grating additional task on the latent space depends on Euclidean distance/¢, 
norm [5,6,12,21). 


the first terms 


3 Proposed Solution 


3.1 Spherical Distance 


Instead of computing the Euclidean distance on R™ space, we measure the dis- 
tance on the projection of data points on the surface of the unit hypersphere 
S™—! instead. By only considering the distance between projections, we elimi- 
nate the magnitude of z; but only its direction. Among them, the cosine distance 
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function is a popular measurement, which has been used in several clustering 
algorithms, notably by Spherical K-means [7]. It is defined as 


Ry ay 
d i 2i,2;)=1- ——a Sh SS 
cosine ( z) j) (=. i) 


This distance function has recently been used in deep clustering problem [2], 
and have demonstrated its effectiveness in comparison with other regularization 
methods such as Batch Normalization and Layer Normalization. However, the 
cosine distance is not a metric because it violates the triangular inequality. In 
addition, consider the case where comparing the distance between two projects 
on the surface of the hypersphere S”~1!. Let us consider the example in Fig. 2. 
Since £; and #; are the projection of x; and x; onto the hypersphere, measuring 
the arc length between ¢; and #; (length of the dashed orange arc) is more 
suitable than the Euclidean distance between them (length of the solid blue line 
segment). Hence, instead of using the doosine, we employ the dspherical a8 


arccos( Scosine (Zi, 24 1 Abn dS 
en ee sine nai) sasccos (TA oe) 
a J 


where arccos(q) is the inverse cosine 1function for a € [—1, 1]. To avoid numerical 
problem of ||z;||2 = 0, we slightly modify dspherical to 


1 Ry ay 
spherical (2% 24) = ore ( lzill2 ie A a -) ’ (4) 


for a very small value of e. 


ty spherical (xj, x) 


N 
x 


sii £; 


Fig. 2. Illustration of the spherical distance. The solid blue line segment (reps. dashed 
orange arc) represents the Euclidean distance between £; and #; (reps. spherical dis- 


tance), which are the projection of x; and x; onto the hypersphere (i.e. Z = Tels ). 
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3.2. Application for Deep Clustering with MSSC 


In this section, we applied the proposed extension in Sect.2.2 for the specific 
problem (3). Modifying the distance in problem (3) from squared Euclidean 
distance to dé leads to the 


spherical 


min Fu Spherical. (Ox, Op, U, s) _ PAE (Ox, Op) st \ pspherical. (Ox, u) (5) 


9p ,Op,u 


=Dle- P(Opy st * (On, 2i)) IB+AD. 3 min dicate” (On, 2%), ut), 


seedy: 


where A is the trade-off parameter. 

Motivated by the success of Adam algorithm [14] in deep learning, especially 
for solving the first term F4" (0, 0p) in (5), we adopt Adam to solve our prob- 
lem. The problem (5) is difficult due to the non-differentiable of the second term 
Fpherical. We apply the following smoothing technique for min function with 
as > 0, 


k 
3, ry = LSEa,(r) = —as 108 exp(—as'1), 


which turns problem (5) into following optimization problem 
min pou Spherical, (On, Op, u, s) = FA" (65, 0p) + \ pSmoocth-Spherical, (On, u) 


O07,O0p,u 
(6) 


k 
a € 
log > exp =e nto t” (Oz, ei); ul). 


l=1 


= Die Op, f° (62,2) + >. — 
i=1 


Each iteration of Adam for solving the problem (6) requires computing gra- 
dient Vo,,0p,ul”. The computation of Vo,.9, 7 can be calculated by the back- 
propagation algorithm [19], and the computation for VF’ is computed by 


rae a? uU zl u € uu 
2 Cm a LE LSEi( at; ) softmax(at; ) I ila [ru lla(leallate)” : Ziy 


Ou i=l \t- (atte) 
(7) 


where softmax(f) = “PCO, = (t)-1 gk = (arccos (2, rn)) : 
=1,.., 


~ Yi exp(=t)? eae, 
FF (0,%i) 
IF? (6% 24) ||” 
The Adam applied for problem (5) is outlined in Algorithm 1. 
Similar to MSSC-JAE-Sphere, MSSC-JAE-Cosine solves the joint-clustering 
problem of auto-encoder with MSSC using the cosine distance (problem (5) with 
deosine instead of Uphericat) by Adam. The difference is the step of computing 


Vibe Spherical. (96 64,,u"), which is done automatically by autograd?. 


and z; = t= lewiyn. 


' https://pytorch.org/docs/stable/autograd.html. 
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Algorithm 1. MSSC-JAE-Sphere: Adam applied for problem (5) 


Input: Data x, number of clusters k, trade-off parameter A, smooth parameter a, 
batch-size b, Adam’s parameter (a, (1, G2) 
Initialization: 
Initialize 0¢,0p (by pre-train of by random initialization). 
Initialize u (by random initialization or by clustering on z = f”(0z,2)). 
repeat 
Sample a batch 2". 
Compute G’ = Vice,aho Os. Oj, u’) with data a’ where 
Von.8D FySphericale (gt, 94, u*) is computed by back-propagation and 
VPP Sphericale (94, 65, u*) is computed by (7). 
Update (0z,0p,) using Adam with G". 
t—t+1. 
until Stopping condition. 


4 


4. 


Numerical Experiment 


1 Datasets 


To study the performances of clustering algorithms, we consider the following 
image and text dataset(s), which are all widely used to benchmark deep cluster- 
ing algorithms: 


mnist: The mnist dataset [16] consists of 70000 gray-scale 28 x 28 images 
over 10 classes of handwritten digits. 

fashion: The fashion dataset [26] consists of 70000 gray-scale 28 x 28 images. 
fashion contains 10 classes of clothing items (shirt, dress, shoe, bag, etc.). 
usps: Similar to mnist dataset, the usps dataset consists of 9298 gray-scale 
16 x 16 images over 10 classes of handwritten digits. 

revi: Similar to [27,28], we used a subset of 10000 documents from the full 
RCV1-v2 corpus (Reuters Corpus Volume 1 Version 2) of the four largest 
classes. Following the procedure in [27], we represent each document by a 
tfidf vector of the 2000 most frequently occurring words. 


4.2 Comparative Algorithms 


We compare the proposed methods (MSSC-JAE-Sphere and MSSC-JAE-Cosine) 
with the following baseline: 


MSSC: MSSC clustering for raw data. 

AE-MSSC: The 2-step approach, which an auto-encoder is first applied for 
dimensionality reduction, followed by MSSC for clustering 

DC-Kmeans [25] solves an alternative of problem (2) by Alternating Direction 
of Multiplier Method (ADMM). 

DCN [28] solves the problem (2) by their proposed alternating stochastic gra- 
dient algorithm. 
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Auto-Encoder Settings: For a fair comparison, we following the setting 
from [28]. For all algorithms, we follow the standard architecture for clustering 
with auto-encoder: the number of node in the encoder is m—500—500—2000-—k, 
where m is the number of features in the dataset and k is the number of clusters; 
and the decoder is the mirror of the encoder. The activation function of embed- 
ding layer and the last layer are linear, whereas the rest are ReLU. Unless speci- 
fied otherwise, the auto-encoder is initialized follow the “Xavier Uniform” scheme 
[9] and pre-train by Adam with learning rate a = 10~? and (1, 82) = (0.9, 0.999) 
for 50 epochs with batch-size of 256. 


Setting for MSSC-JAE-Cosine: We use the Adam optimizer with learning rate 
a = 3x10~4, (1, G2) = (0.9, 0.999) and batch-size of 256. The smooth parameter 
as = 16. The algorithm stop when either of the following criteria is met: (1) 500 


J k k J k-1 k-1 
epochs or (2) convergence (4 rg oi ) << €or ||(0,u)*¥—(8, uF Io < 


e). For initial point, the weighted 0° = (6%,6%) is obtained by the procedure 


above. u° is initialized as the best result (by objective value) among 10 runs of K- 
E90 . 
means on extracted feature {ara} . The activation in auto-encoder 
Beet i=1 n 


is Soft Plus - a smooth approximation of ReLU: Softplus() = + log(1+exp(@z)), 


= 3 
with 3 = 256. 


Setting for MSSC-JAE-Sphere: The setting is the same as MSSC-JAE-Cosine’s 
except u°. We set € = 10~“ for the d§nericat- For u°, we solve the MSSC-Sphere 
problem (MSSC with spherical Instead of dsquared Euclidean) by Adam optimizer 
(default parameters) for 10 runs for extracted data {F"(6%,2i)},_, _,- u® is 
selected as the result whose objective value is smallest. The algorithm is imple- 
mented in PyTorch?. 


Setting for DC-Kmeans: We use the authors’ implementation®. For training 
the auto-encoder, we use Matlab’s neural network toolbox. We also notice 
that the initial point scheme for K-means from the author’s implementa- 
tion often produces bad results, so we use to the same procedure for initial 
point as MSSC-JAE-Cosine but with extracted feature {F” (0%, x;)}i=1 
DC-Kmeans, the authors set \ = 1. 


juaieary: 


Setting for DCN: We use the source code available at*. 


4.3 Experiment Setting 


Evaluation criteria: Given an input x; with ground-truth label /;; and p; is the 
assignment label from clustering algorithm, we measure the following criteria to 
evaluate experimental results: 


2 https: //pytorch.org/. 
3 https: //github.com /JennyQQL/DeepClusterADMM-Release. 
4 https: //github.com/MaziarMF /deep-k-means. 
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— Clustering Accuracy (ACC [3]): ACC is calculated as ACC(l,p) = 
11 Im(p)=1;, Where m(x;) is the function which maps each clustering 
assign in p; to the equivalent label /;. In our case, we use the mapping by 
using the Kuhn-Munkres algorithm [17]. 

— Normalized mutual information (NMI [24]): The NMI criterion is cal- 


culated as NMI(I, p) = WEOTIOE where J(1,p) is the mutual information of 
Pp 


l and p. 


All deep clustering algorithms in our experiment (except MSSC and AE+KM) 
has a hyper parameter controls the trade-off between the auto-encoder 
and the clustering. For choosing the A, we performs a grid search A € 
{10-7,10-®,...,107}. We repeat the experiment 10 times, select the results 
which has the highest accuracy among 10 runs, and report the average with 
standard deviation of each criterion. 

All experiments are conducted on a Intel(R) Xeon (R) CPU E5-2630 v4 
@2.20 GHz with 32 GB of RAM and a GTX 1080 GPU. 


4.4 Experiment Results 


The results for the evaluation of deep clustering alogrithms on different datasets 
are reported in in Table 1. 

The results show that reducing the number of dimension by auto-encoder 
facilitates the clustering process: the increase in accuracy by using the 2-step 
approach (AE+MSSC) over clustering on raw data (MSSC) is significant. The gap in 
clustering accuracy is up to 32.56% as in mnist dataset. However, the final result 
of both MSSC and AE+MSSC do not exceed the joint-clustering approach (DCN, 
MSSC-JAE-Cosine, and MSSC-JAE-Sphere). DCN further improves the accuracy 
of AE+MSSC among all 4 datasets, range from 0.33% to 2.68%. 

The proposed methods (MSSC-JAE-Cosine and MSSC-JAE-Sphere) further 
improve the clustering quality over DCN. The increase in terms of clustering accu- 
racy is consistent, ranging from 1.21% to 7.63% (reps. from 2.48% to 8.17%) 
for MSSC-JAE-Cosine (reps. MSSC-JAE-Sphere). Both MSSC-JAE-Cosine and 
MSSC-JAE-Sphere produces better results than DCN. The NMI of both results 
of MSSC-JAE-Cosine and MSSC-JAE-Sphere is higher than DCN’s, up to 3.90% 
and 6.72% respectively. This result demonstrates the importance of regulariza- 
tion on the latent space, which is achieved in this case by projection the data 
point’s representation in the latent space onto the @ ball. In addition, among 
two methods that utilize the @2 projection in the latent space, MSSC- JAE-Sphere 
is undoubtedly better than MSSC-JAE-Cosine: the gap in clustering accuracy is 
from 0.22% (usps dataset) to 2.5% (mnist dataset). This increase indicates the 
importance of using the appropriate distance measure in the latent space. 

In conclusion, both MSSC-JAE-Cosine and MSSC-JAE-Sphere are the 
improvement over DCN and DC-Kmeans. In addition, MSSC-JAE-Sphere is bet- 
ter than MSSC-JAE-Cosine, which demonstrates the importance of using the 
appropriate distance measure. 
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Table 1. Comparative result between Auto-encoder-based joint-clustering algorithms. 
Bold values correspond to best results for each dataset. NA means that the algorithm 
fails to furnish a result. 


Dataset | Algorithms NMI Accuracy 
Average | STD Average | STD 
usps MSSC 61.35% | 0.01% | 67.26% 0.05% 
n = 9298 | AE+MSSC 65.41% 1.09% 69.14% | 0.52% 
d= 256 | DC-Kmeans 55.26% | 0.11% | 60.55% 0.16% 
k= 10 DCN 69.68% | 0.60% | 71.21% | 0.29% 


MSSC-JAE-Cosine 70.59% 1.73% | 73.46% | 0.62% 
MSSC-JAE-Sphere 69.98% (1.11% | 73.68% 0.79% 


rev MSSC 31.30% 5.40% | 50.80% | 2.90% 
n = 10000 | AE+MSSC 35.99% 5.47% | 55.36% | 4.70% 
d= 2000 DC-Kmeans NA NA NA NA 

k=4 DCN 31.54% 4.58% | 58.05% | 4.74% 


MSSC-JAE-Cosine 34.80% 10.26% | 61.69% | 9.27% 
MSSC-JAE-Sphere 38.27% 5.55% | 64.19% 6.09% 


mnist MSSC 44.25% (0.02% | 48.24% | 0.05% 
n = 70000 AE+MSSC 75.63% 0.54% | 81.23% | 1.83% 
d = 784 DC-Kmeans 75.65% 0.02% 78.04% | 0.02% 
k=10 DCN 76.96% 0.70% | 83.83% | 1.31% 


MSSC-JAE-Cosine | 80.86% | 0.75% | 85.04% | 2.30% 
MSSC-JAE-Sphere 82.81% 2.04% | 86.85% 6.44% 


fashion MSSC 51.24% 0.01% | 53.99% | 0.07% 
n = 70000 AE+MSSC 55.48% (0.72% | 53.04% | 1.99% 
d = 784 DC-Kmeans 51.64% (2.96% 47.61% | 2.44% 
k= 10 DCN 56.51% 0.58% | 53.37% | 1.18% 


MSSC-JAE-Cosine 60.21% 1.15% | 61.01% | 2.58% 
MSSC-JAE-Sphere 61.34% 0.67% | 61.54% 3.25% 


5 Conclusion 


We have studied the scaling problem in the latent space for a class of deep 
clustering algorithm. We proposed an extension by using cosine and spherical 
distance measure, which is applicable when the derived optimization problems 
suffer from the scaling of data’s representation in the latent space. Both distance 
measures are invariance to scaling since they compute the distance between pro- 
jections of data points onto the surface of the unit hypersphere S”~! instead 
of between the data points in R™. As an application, we considered the spe- 
cific problem of deep joint-clustering with MSSC and proposed two algorithms 
(MSSC-JAE-Cosine and MSSC-JAE-Sphere) to solve the mentioned problem. The 
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numerical results present the effectiveness of proposed algorithms in comparison 
with state-of-the-art algorithms in joint-clustering by K-means: the clustering 
accuracy is higher than the second-best method (from 3.90% to 6.72%). Among 
the proposed extensions, MSSC-JAE-Sphere improves the clustering accuracy 
MSSC-JAE-Cosine by up to 2.5%. It demonstrates the importance of using the 
appropriate distance measure in the latent space for a class of deep clustering 
algorithms. 
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Abstract. We study channel number reduction in combination with 
weight binarization (1-bit weight precision) to trim a convolutional neu- 
ral network for a keyword spotting (classification) task. We adopt a 
group-wise splitting method based on the group Lasso penalty to achieve 
over 50% channel sparsity while maintaining the network performance 
within 0.25% accuracy loss. We show an effective three-stage procedure 
to balance accuracy and sparsity in network training. 


Keywords: Convolutional neural network - Channel pruning - Weight 
binarization - Classification 


1 Introduction 


Reducing complexity of neural networks while maintaining their performance is 
both fundamental and practical for resource limited platforms such as mobile 
phones. In this paper, we integrate two methods, namely channel pruning and 
weight quantization, to trim down the number of parameters for a keyword 
spotting convolutional neural network (CNN, [4]). 

Channel pruning aims to lower the number of convolutional channels, which 
is a group sparse optimization problem. Though group Lasso penalty [8] is known 
in statistics, and has been applied directly in gradient decent training of CNNs [7] 
earlier, we found that the direct approach is not effective to realize sparsity for the 
keyword CNN [4,6]. Instead, we adopt a group version of a recent relaxed variable 
splitting method [2]. This relaxed group-wise splitting method (RGSM, see [10] 
for its first study on deep image networks) accomplished over 50% sparsity while 
keeping accuracy loss at a moderate level. In the next stage (IJ), the original net- 
work accuracy is recovered with a retraining of float precision weights while leaving 
out the pruned channels in stage I. In the last stage (III), the network weights are 
binarized into 1-bit precision with a warm start training based on stage II. At the 
end of stage III, a channel pruned (over 50%) and weight binarized slim CNN is 
created with validation accuracy within 0.25 % of that of the original CNN. 


© Springer Nature Switzerland AG 2020 
H. A. Le Thi et al. (Eds.): ICCSAMA 2019, AISC 1121, pp. 243-254, 2020. 
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The rest of the paper is organized as follows. In Sect. 2, we review the net- 
work architecture of keyword spotting CNN [4,6]. In Sect.3, we introduce the 
proximal operator of group Lasso, RGSM, and its convergence theorem where 
an equilibrium condition is stated for the limit. We also outline binarization, the 
BinaryConnect (BC) training algorithm [1] and its blended version [11] to be 
used in our experiment. Through a comparison of BC and RGSM, we derive a 
hybrid algorithm (group sparse BC) which is of independent interest. In Sect. 4, 
we describe our three stage training results, which indicate that RGSM is the 
most effective method and produces two slim CNN models for implementation. 
Concluding remarks are in Sect. 4. 


2 Network Architecture 


Let us briefly describe the architecture of keyword CNN [4,6] to classify a one 
second audio clip as either silence, an unknown word, ‘yes’, ‘no’, ‘up’, ‘down’, 
‘left’, ‘right’, ‘on’, ‘off’, ‘stop’, or ‘go’. After pre-processing by windowed Fourier 
transform, the input becomes a single-channel image (a spectrogram) of size 
tx f, same as a vector v € R'/, where t and f are the input feature dimension 
in time and frequency respectively. Next is a convolution layer that operates 
as follows. A weight tensor W € R(™*")x!x”" is convolved with the input v. 
The weight tensor is a local time-frequency patch of size m x r, where m < t 
and r < f. The weight tensor has n hidden units (feature maps), and may 
down-sample (stride) by a factor s in time and wu in frequency. The output of 
the convolution layer is n feature maps of size (t -m+41)/s x (f —r+1)/u. 
Afterward, a max-pooling operation replaces each p x q feature patch in time- 
frequency domain by the maximum value, which helps remove feature variability 
due to speaking styles, distortions etc. After pooling, we have n feature maps of 
size (t—m-+1)/(sp) x (f —r+1)/(uq). An illustration is in Fig. 1. The keyword 
CNN has two convolutional (conv) layers and a fully connected layer. There is 
1 channel in the first conv. layer and there are 64 channels in the second. The 
weights in the second conv. layer form a 4-D tensor W?) € RW*#*CxN | where 
(W, H, C, N) are dimensions of spatial width, spatial height, channels and filters, 
C = 64. 


3 Complexity Reduction and Training Algorithms 


3.1 Group Sparsity and Channel Pruning 


Our first step is to trim the 64 channels in the second conv. layer to a smaller 
number while maintaining network performance. Let weights in each channel 
form a group, then this becomes a group sparsity problem for which group Lasso 
(GL) has been a classical differentiable penalty [8]. Let vector 


d dxG 
w= (Wi,°++ ,Wg,°+ Wa), We € R°, wER ; 
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@ conv2d J MaxPool2d (J FC HH ReLu Input J Output 


Fig. 1. A convolutional model with single-channel input. We use a simple notation to 
indicate conv. and max-pooling layers. For instance, 8 x 20_cl_f64-s1 indicates a conv. 
layer with kernel size 8 x 20, channel number 1, filter number 64 and stride 1. 


where G is the number of groups. Let I, be the indices of w in group g. The 
group-Lasso penalty is [8]: 


G 
lloller = D> |lwelle- (3.1) 
g=1 


It is easy to implement GL as an additive penalty term for deep neural network 
training [7] by minimizing a penalized objective function of the form: 


f(w) := f(w) + pP(w), p>, (3.2) 


where ¢(w) is a standard loss function on data such as cross entropy [12], and 
P(w) is a penalty function equal to sum of weight decay (¢2 norm all network 
weights) and GL. For ease of notation, we merge the weight decay term with ¢ 
and take P as GL below. 

In a case study of training CNN with un-structured weight sparsity [9], a 
direct minimization of @; type penalty as an additive term in the training objec- 
tive function provides less sparsity and accuracy (Table 4 of [9]) than the Relaxed 
Variable Splitting Method (RVSM [2]). In the group sparsity setting here, we 
shall see that the direct minimization of GL in (3.2) is also not efficient. Instead, 
we adopt a group version of RVSM [2], which minimizes the following Lagrangian 
function of (u, w) alternately: 


Lu, tv) = €(w) + 1 P(u) + 5 fw — vl, (3.3) 


for a parameter ( > 0. 


246 J. Lyu and 8. Sheen 


The u-minimization is in closed form for GL. To see this, consider finding 
the GL proximal (projection) operator by solving: 
* : 1 2 
y= argmin, 5 |ly— wll" + Allyllez, (3.4) 
for parameter \ > 0 or group-wise: 
: 1 
yg =argmin,, A||ygl| + 5 de lye — wall? (3.5) 
i€ Ig 


If y5 # 0, the objective function of (3.5) is differentiable and setting gradient to 
ZeYO gives: 
Ygyi — Wai + AYg,i/IlYol| = 9, 


or: 
(1+ A/|lyoll¥oe = wea, Vi € Iy, 
implying: 
(1+ A/llyoll) Ilyoll = lel, 
or: 


Ilygll = [lwll — A, if [wall > A. (3.6) 


Otherwise, the critical equation does not hold and yj = 0. The minimal point 
formula is: 


Yp.i = Wg i(1 + A/([lwgl] — A))~* = we, a(Ilegll — A)/llwyll, if llwell > As 
otherwise, yf = 0. The result can be written as a soft-thresholding operation: 
uy, = Proxgy, (Wg) ‘= Wy max((|~9l| — 2,0) /lewgl (3.7) 


The w minimization is by gradient descent, implemented in practice as 
stochastic gradient descent (SGD). Combining the u and w updates, we have 
the Relaxed Group-wise Splitting Method (RGSM): 


De = Proxgz,a(w;), g=1,-:-,G, 


wt? = wt —nVe(w') — 9B (wt —u'), (3.8) 


where 77 is the learning rate. 


3.2 Theoretical Aspects 


The main theorem of [2] guarantees the convergence of RVSM algorithm under 
some conditions on the parameters (A, 3,7) and initial weights in case of one 
convolution layer network and Gaussian input data. The latter conditions are 
used to prove that the loss function @ obeys Lipschitz gradient inequality on the 
iterations. Assuming that the Lipschitz gradient condition holds for 2, we adapt 
the main result of [2] into: 
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Theorem 1. Suppose that ¢ is bounded from below, and satisfies the Lipschitz 
gradient inequality: ||Ve(x) — Ve(y)|| < Lilla — yl, V(v,y), for some positive 
constant L. Then there exists a positive constant no = No(L, 8) € (0,1) so that 
ifn <0, the Lagrangian function Lg(u',w') is descending and converging in 
t, with (u’,w') of RGSM algorithm satisfying ||(ut*!, w't+) — (ut, w*)|| = 0 as 
t > +00, and subsequentially approaching a limit point (t,®). The limit point 
(ti, W) satisfies the equilibrium system of equations: 


Ug = Proxgr,,(tWg), g= 1, as ,G, 


Ve(a) = B(u—w). (3.9) 


Remark 1. The system (3.9) serves as a “critical point condition”. The @ is the 
desired weight vector with group sparsity that network training aims to reach. 


Remark 2. The group-€9 penalty is: 


G 


lellezo =D) Lag: ll20) (3.10) 
g=1 


Then the GL proximal problem (3.5) is replaced by: 


yg =argmin, Aljy40 + 5 S- Il¥o,i — Wy,ll?- (3.11) 
i€l, 


If yy = 0, the objective equals ||w,||3/2. So if A > ||wy||?/2, yy = 0 is a minimal 
point. If A < ||w||?/2, yg = wy gives minimal value \. Hence the thresholding 
formula is: 


Yg = Proxgyy,\(Wg) = Wg M ilasaox (3.12) 


Theorem 1 remains true with (3.9) modified where Proxgz,, is replaced by 


Proxgso,)- 


3.3. Weight Binarization 


The CNN computation can speed up a lot if the Waa are in the binary vector 
form: float precision scalar times a sign vector (--- ,+1,+1,---), see [3]. For the 
keyword CNN, such weight binarization alone doubles the speed of an Android 
app that runs on Samsung Galaxy J7 cellular phone [5] with standard tensorflow 
functions such as ‘conv2d’ and ‘matmul’. 

Weight binarized network training involves a projection operator or the solu- 
tion of finding the closest binary vector to a given real vector w. The projection 
is written as projgw, for w € R?, Q = Ry x {+1}”. When the distance is 
Euclidean (in the sense of £2 norm || - ||), the problem: 


PLOjg a (w) = argmin,cg l|z — w|| (3.13) 
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has exact solution [3]: 


p aa |w;| 
Projga(w) = a; an sen(w) (3.14) 


where sgn(w) = (q1,°°* 54j,°°* »@p), and 


— 1 if W5 > 0 
149 = ) _1 otherwise. 


The projection is simply the sgn function of w times the arithmetic average of 
the absolute values of the components of w. 
The standard training algorithm for binarized weight network is BinaryCon- 
nect [1]: 
wit = wy — 1 VE(w'), w't! = projg a(wy''), (3.15) 


where {w‘} denotes the sequence of binarized weights, and {w‘} is an auxiliary 
sequence of floating weights (32 bit). Here we use the blended version [11]: 


we = (1 p) wy + pw' —nVE(w'), wit = projga(wy) (3.16) 


for 0 < p <1. The algorithm (3.16) becomes the classical projected gradient 
descent at p = 1, which suffers from weight stagnation due the discreteness of 
w’ however. The blending in (3.16) leads to a better theoretical property [11] 
that the sufficient descent inequality holds if the loss function @ has Lipschitz 
gradient. 


Remark 8. In view of (3.8) and (3.16), we see an interesting connection that 
both involve a projection step, as Prox is a projection in essence. The difference 
is that V2 in BC is evaluated at the projected weight w’. If we mimic such a 
BC-gradient, and evaluate the gradient of Lagrangian in w at u* instead of w®, 
then (3.8) becomes: 


Uy = Proxgz,a(w;); g=1,---,G, 
wt! = wt —nVe(u*). (3.17) 


We shall call (3.17) a Group Sparsity BinaryConnect (GSBC) algorithm and 
compare it with RGSM in our experiment. 


4 Experimental Results 


In this section, we show training results of channel pruned and weight binarized 
audio CNN based on GL, RGSM, and GSBC. We assume that the objective 
function under gradient descent is ¢(-) + || ||gz, with a threshold parameter 
Xr. For GL, uw > 0, A = 0, @ = 0. For RGSM, uw = 0, A > 0, @ = 1. For 
GSBC, w= 0, A > 0, @ = 0. The experiment was conducted in TensorFlow on a 
single GPU machine with NVIDIA GeForce GTX 1080. The overall architecture 
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[6] consists of two convolutional layers, one fully-connected layer followed by 
a softmax function to output class probabilities. The training loss ¢(-) is the 
standard cross entropy function. The learning rate begins at 7 = 0.001, and is 
reduced by a factor of 10 in the late training phase. The training proceeds in 3 
stages: 


— Stage I: channel pruning with a suitable choice of w or A so that sparsity 
emerges at a moderate accuracy loss. 

— Stage II: retrain float precision (32 bit) weights in the un-pruned channels at 
the fixed channel sparsity of Stage I, aiming to recover the lost accuracy in 
Stage I. 

— Stage III: binarize the weights in each layer with warm start from the pruned 
network of Stage II, aiming to nearly maintain the accuracy in Stage II. 


Stage I begins with random (cold) start and performs 18000 iterations 
(default, about 50 epochs). Figure 2 shows the validation accuracy of RGSM at 
(A, 3) = (0.05, 1) vs. epoch number. The accuracy climbs to a peak value above 
80% at epoch 20, then comes down and ends at 59.84%. The accuracy slide 
agrees with channel sparsity gain beginning at epoch 20 and steadily increas- 
ing to nearly 56% at the last epoch seen in Fig.3. The bar graph in Fig. 4 
shows the pruning pattern and the remaining channels (bars of unit height). At 
(A, 2) = (0.04,1), RGSM stage I training yields a higher validation accuracy 
76.6% with a slightly lower channel sparsity 51.6%. At the same (A, 3) values, 
GSBC gives an even higher validation accuracy 80.9% but much lower channel 
sparsity of 26.6%. The GL method produces minimal channel sparsity in the 
range pt € (0,1) covering the corresponding value where sparsity emerges in 
RGSM. The reason appears to be that the network has certain internal con- 
straints that prevent the GL penalty from getting too small. Our experiments 
show that even with the cross-entropy loss £(-) removed from the training objec- 
tive, the GL penalty cannot be minimized below some positive level. The Stage-I 
results are tabulated in Table 1 with a GL case at ps = 0.6. It is clear that RGSM 
is the best method to go forward with to stage II. 


Table 1. Validation accuracy (%) and channel (ch.) sparsity (%) after Stage I (ch. 
pruning). 


Model B\A fu | Accuracy | Ch. Sparsity 
Original Audio-CNN | 0 | 0 0 | 88.5 0 

GL Ch-pruning 0/0 0.6 | 66.8 0 

RGSM Ch-pruning |1)|4.e—-2/0 76.6 51.6 

RGSM Ch-pruning /1|5.e—2/}0 | 59.8 56.3 

GSBC Ch-pruning /0/4e-2/0 | 80.9 26.6 
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Fig. 2. Validation accuracy vs. number of epochs in Stage-1 training by RGSM at 
(A, 8) = (0.05, 1). The accuracy at the last epoch is 59.84%. 
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Fig. 3. Channel sparsity vs. number of epochs in Stage-1 training by RGSM at (A, 8) = 
(0.05, 1). The sparsity at the last epoch is 56.3%. 


In Stage II, we mask out the pruned channels to keep sparsity invariant 
(Fig.5), and retrain float precision weights in the complementary part of the 
network. Figure 7 shows that with a dozen epochs of retraining, the accuracy of 
the RGSM pruned model at A = 0.04 (0.05) in Stage I reaches 89.2% (87.9%), 
at the level of the original audio CNN (Table 2). 
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Fig. 4. Remaining channels illustrated by bars vs. channel number (1 to 64) after 
Stage-1 training by RGSM at (A, 3) = (0.05, 1). The sparsity (% of 0’s) is 56.3%. 


Table 2. Validation accuracy (%) and channel (ch.) sparsity (%) after Stage II (float 
precision weight retraining). 


Model B\A uu | Accuracy | Ch. Sparsity 
Original Audio-CNN 0|0 0 88.5 0 

RGSM Ch-pruning + Float weight retrain | 1 | 4.e—2|0 89.2 51.6 

RGSM Ch-pruning + Float weight retrain | 1 | 5.e—2)0 | 87.9 56.3 


In Stage HI, with blending parameter p = 1.e—5, the weights in the net- 
work modulo the masked channels are binarized with validation accuracy 88.3% 
at channel sparsity 51.6%, and 87% at channel sparsity 56.3%, see Fig.6 and 
Table 3. 
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Fig. 5. The CNN model with pruned channels masked out (the masking layer in red). 
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Fig. 6. Validation accuracy vs. number of epochs in Stage-2 float (32 bit) weight 
retraining. The accuracy at the last epoch is 87.94%. Channel sparsity is 56.3%. 
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Fig. 7. Validation accuracy vs. number of epochs in Stage-3 binary (1-bit) weight 
training. The accuracy at the last epoch is 87%. Channel sparsity is 56.3%. 


Table 3. Validation accuracy (%) and channel (ch.) sparsity (%) after Stage II (weight 
binarization training). 


Model Br ju | Accuracy | Ch. Sparsity 
Original Audio-CNN 0/0 0 | 88.5 0 

RGSM Ch-pruning + Weight binarization | 1 | 4.e—2 | 0 | 88.3 51.6 

RGSM Ch-pruning + Weight binarization | 1 | 5.e—2 | 0 | 87.0 56.3 


5 Conclusion and Future Work 


We successfully integrated a group-wise splitting method (RGSM) for channel 
pruning, float weight retraining and weight binarization to arrive at a slim yet 
almost equally performing CNN for keyword spotting. Since channel pruning 
involves architecture change, there is additional work to speed up a hardware 
implementation. Preliminary test on a MacBook Air with a CPU version of 
Tensorflow shows as much as 28.87% speed up by the network structure with 
float precision weight in Fig.5. An efficient way to implement the masking layer 
without resorting to an element-wise tensor multiplication (especially on a mobile 
phone) is worthwhile for our future work. 

We also plan to study other penalties [2] such as group-f9 (transformed-¢,) 
in the RGSM framework as outlined in Remark 2, and extend the three stage 
process developed here to multi-level complexity reduction on larger CNNs and 
other applications in the future. 
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Abstract. Biomedical image classification often deals with limited 
training sample due to the cost of labeling data. In this paper, we pro- 
pose to combine deep learning, transfer learning and generative adver- 
sarial network to improve the classification performance. Fine-tuning on 
VGG16 and VGG19 network are used to extract the good discriminated 
cancer features from histopathological image before feeding into neu- 
ron network for classification. Experimental results show that the pro- 
posed approaches outperform the previous works in the state-of-the-art 
on breast cancer images dataset (BreaKHis). 


Keywords: Deep learning - Transfer learning - BreaK His dataset - 
Breast cancer - Histopathological image classification - GAN 


1 Introduction 


Breast cancer is the most common invasive cancer in women and have a signif- 
icant impact to 2.1 million people yearly. In 2018, the World Health Organiza- 
tion (WHO) estimated 627,000 death cases because of breast cancer, be getting 
15% death causes. Early cancer detection might help to treat and increase sur- 
vival rate for patients. WHO finds that there are the effective diagnostic meth- 
ods such as X-ray, Clinical Breast Exam but it needs to have the professional 
physicians or experts. In fact, the diagnostic result is not always 100% accuracy 
because of some reasons such as subjective experiments, expertise, emotional 
state. There are several applications of computer vision for Computer-Aided 
Diagnosis (CADx) have been proposed and implemented [6,7]. The breast can- 
cer can be diagnosed via histopathological microscopy imaging, for which image 
analysis can aid physicians and technical expert effectively [7,12]. 

Moreover, the CADx system for breast cancer diagnosis is still challenging 
until now due to the complexity of the histopathological images. In the last 
decade, many works have been proposed to enhance the recognition performance 
of breast cancer image. They can be categorized into three groups: 


© Springer Nature Switzerland AG 2020 
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— Handcrafted-feature or deep feature: Spanhol and Badejo [3,28] com- 
pare several handcrafted features extracted from Local Binary Patterns, Local 
Phase Quantization, Gray Level Co-Occurrence Matrices, Free Threshold 
Adjacency Statistic, Oriented FAST and Rotated BRIEF based on 1-NN, 
SVM and Random forest classifiers. Alom et al. [2] combine the strength of 
Inception, ResNet and Recurrent Convolutional Neural Network with and 
without augmentation for 4 magnification factors. Zhang et al. [34] propose 
a method to use skip connection in Resnet in order to solve the optimization 
issues when network becomes deeper. Roy et al. [21] propose a patch-based 
classifier using CNN network consisting of 6CONV-5POOL-3FC. 

— Transfer learning approach: Weiss et al. [32] evaluate different features 
extracted from VGG, ResNet and Xception with a limited training samples 
and achieved a good result in the state-of-the-art on BACH dataset. This 
method downsized BACH image into 1024 x 768 in order to build the clas- 
sification model. Vo et al. [31] apply the augmentation techniques as rotate, 
cut, transform image to increase the training data before extracting deep 
feature from Inception-ResNet-v2 model in order to avoid the over-fitting. 
Vo trained the model with multi-scale input images 600 x 600, 450 x 450, 
300 x 300 to extract local and global feature. Then Gradient Boosting Trees 
model again was trained to detect breast cancer. Fusion model will vote the 
higher accuracy classifier. The accuracy rate archived to 93.8%-96.9% at low 
cost computation. Murtaza et al. [18] use Alexnet as feature extraction hierar- 
chical classification model by combination of 6 classifiers to reduce the feature 
space and increase the performance. 

— Generative Adversarial Network (GAN) method: Shin et al. [24] apply 
Image-to-Image Conditional GAN mode (pix2pix) to generate synthesis data 
and discriminate T1 brain tumor class on ADNI dataset. They then use this 
model on other dataset namely, BRATS to classify T1 brain tumor. This GAN 
model can increase accuracy compared to train on the real image dataset. 
Iqbal et al. [8] propose a new GAN model for Medical Imaging (MI-GAN) to 
generate synthetic retinal vessel images for STARE and DRIVE dataset. This 
method generated precise segmented image better than existing techniques. 
Author declared that synthetic image contained the content and structure 
from original images. Senaras et al. [22] employ a conditional GAN (cGAN) to 
generate synthetic histopathological breast cancer images. Six readers (three 
pathologists and image analysts) tried to differentiate 15 real from 15 syn- 
thetic images and the probability that average reader would be able to cor- 
rectly classify an image as synthetic or real more than 50% of the time was 
only 44.7%. Mahapatra et al [15] propose a P-GANs network to generate a 
high-resolution image of defined scaling factors from a low-resolution image. 


Both handcrafted and deep feature demonstrate the good cancer detection 
capability. Various researches combine numerous color features and local texture 
descriptors to improve the performance [1,16]. Modak et al. [16] did compara- 
tive analysis of several multi-biometric fusions consisting levels of feature-mostly 
feature concatenation, score or rules/algorithms level. Authors statistically 
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analyzed that fusion approach represents many advantages than single mode 
such as accuracy improvement, noise data and spoof attack reduction, more con- 
venience. Fotso Kamga Guy et al. [1] exploited the powerful transfer-learning 
technique from popular models such as Alexnet, VGGNet-16, VGGNet-19, 
GoogleNet and ResNet to design the fusion schema at feature level for satel- 
lite images classification. It is said that fusion from many ConvNet layers are 
better than feature extracted from single layer. Features extracted from CNN 
network is less effected by different conditions such as edge of view, color space; it 
is an invariant feature and getting the better generalization. Thus data augmen- 
tation methods might affect the accuracy if it is applied inadequately. In order 
to save low computation cost from scratch, transfer learning technique can be 
considered to employ in medical field. It needs to be retrained or fine-tuning in 
some layers so that these networks can detect the cancer features. Furthermore, 
GAN is the effective data augmentation method in computer vision but GAN 
training process is still a difficult problem. These method have been investigated 
intensively for common data and rarely for medical data. To overcome this limi- 
tation, we propose a composition method of three techniques to be boosting the 
breast cancer classification accuracy in a limited training data. 

The rest of this paper is organized as follows. Section 2 introduces our pro- 
posed approach by combining three methods such as transfer learning, deep 
learning and GAN. The experimental results are then introduced in Sect. 3. 
Finally, the conclusion is given in Sect. 4. 


2 Proposed Approach 


In the recent years, Convolutional Neural Network (CNN) proved as an effi- 
cient approach in computer vision and have significantly improved in cancer 
classification. Both VGG16 and VGG19 are proven to be a good candidate in 
transfer-learning technique. To get the discriminated benign and malignant from 
the tumor features, the base networks have to retrained on BreaKHis dataset 
and then be used as an input for CNN network. 

A combination of different feature extraction methods can increase the clas- 
sification accuracy. This work uses VGG16 network and then both VGG16 & 
VGG19 to extract the features. The proposed architecture is summarized in 
Fig. 1 and can be described in the following steps: 


— Input layer: the input layer has three channels of 256 x 256 pixels which 
normalized from RGB patch images. 

— Fine-tuning VGG16 and VGG16 & VGG19 feature extraction: the 
first 17 layers of VGG16 and VGG19 has primitive low-level spatial char- 
acteristic learned on ImageNet dataset which can be transferred to medical 
dataset. To later higher convolutional layer, they are trained according to 
BreakKHis dataset. 

— Batch normalization: layer to normalize a number of activations in com- 
bination layer of VGG16 & VGG19’s output layer to reduce overfitting from 
ImageNet’s original weight. 
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Fig. 1. (a) Fine-tuning VGG16 and CNN, (b) Fine-tuning VGG16 & VGG19 and CNN 


— Full connected layer: all neurons in this layer have full connections to 
previous layer’s neurons. 
— Rectified Linear Units (ReLU) layer: ReLU activation layer 


f(x) = max(0, 2) (1) 


will output previous layer value if it is positive, otherwise it will output zero. 
So ReLU layer is used many in deep learning because it helps the network to 
be trained easily and achieve the better performance. 

— Dropout layer: is a regularization technique which removes some neurons 
randomly out network with probability 0.2 during forward or backward prop- 
agation process. 
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— Output layer: the layer uses a non-linear activation - sigmoid function. 


1 


hl) = Tia 


(2) 

Furthermore, three voting methods are applied to compute the model accu- 
racy based on the patch image for two malignant or benign class. We define the 
so called method A is to select a majority predicted accuracy of the 4 patch 
images as final result of orginal image. Method B is a similar to A however, if 
2 patch images is correctly predicted and 2 patch images is wrongly predicted, 
the final results of original image will be assigned as correct. Otherwise, method 
C is defined as at least one patch image is correct, orginal image is predicted as 
correct. 


3 Experiments 


3.1 Dataset Description 


We propose to evaluate the proposed approach on one real histopathological 
image database (BreaKHis) and two generated databases from BreaKHis by 
GAN. The following subsection describes theses datasets. 


The BreaKHis Dataset. [28] is a recent benchmark database proposed by 
Spanhol et al. to study the automated classification problem for breast cancer. 
This dataset contains 7,909 images (see Fig.2) of 82 patients using 4 mag- 
nifying factors (40x, 100x, 200x, 400x). It is divided into 2 main groups: 
benign and malignant tumors, 8 sub cancer type as well totally size is 4GB. It 
is publicly available from https: //web.inf.ufpr.br/vri/databases /breast-cancer- 
histopathological-database-breakhis 


Fig. 2. Illustration of BreaKHis database at different magnification factors of benign 
cell 40x (a), 100x (b), 200x (c), 400x (d) and malignant cell 40x (e), 100x (f), 200x 
(), 400% (hh). 
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The Fake Break His. images generated from StyleGAN transfers [11] the style 
image to input latent space z by using mapping network f to create an imme- 
diate feature space w. The adaptive instance normalization (AdaIN) technique 
is applied to control the style transferred image. We use StylgeGAN to generate 
the fake benign and malignant image for each scale of 40x, 100, 200x, 400 
(Fig. 3). StyleGAN is trained with 256 x 256 BreakKHis image for the indepen- 
dent scale and type on a PC with NVIDIA Tesla P100 1GPU during 8h. 


Fig. 3. Illustration of generated database by StyleGAN at different magnification fac- 
tors of benign cell 40x (a), 100x (b), 200x (c), 400x (d) and malignant cell 40x (e), 
100x (f), 200x (g), 400x (h). 


The Fake BreakKHis. generated by Pix2Pix which is a conditional GAN net- 
work proposed by Isola et al. [9]. This framework applies U-Net model and skip 
connector technique as proposed generator network and discriminator architec- 
ture from PatchGAN to penalize structure at patch scale. To synthesize cancer 
image at each rate, we trained Pix2Pix network by using conditional image as 
the generated magnification rate image and the rest of magnification rates as 
input image. Benign 40x rate image will be conditional image and Begnign 
100x, 200x, 400x rate images will be used as input image. Because of complex 
cancer structure, most of latent space from other magnification rate images can 
be transferred to the target image and might maintain original feature (Fig. 4). 


3.2 Experimental Setup 


The accuracy was estimated by a cross validation method through 5 iterations 
while the ratio of training and testing set ratio of each class are 70% and 30%, 
respectively. The reason that we choose this ration because it is the most common 
decomposition (be applied in more than 20 papers) in the literature on Break His 
dataset. We train the proposed approach with BreaKHis dataset mentioned in 
a previous section. Firstly, the histopathological image will be divided into 2 
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(a) (b) (c) (d) 

(e) (f) (g) (h) 
Fig. 4. Magnification factor of fake benign cell 40x (a), 100x (b), 200x (c), 400x (d) 
and fake malignant cell 40x (e), 100x (f), 200x (g), 400 (h) from Pix2Pix model. 


Fig. 5. Magnification factor of 40x benign image (a), top half (b), a bottom half (c), 
a left half (d) a right half (e) 


patches by horizontal (resulting in Fig. 5b and c) and vertical direction (resulting 
in Fig. 5d and e). 

The image patch size is 700 x 230 pixels in horizontal direction and 350 x 460 
pixels in vertical direction. In stead of extracting small patch size as 32 x 32 pixels 
or 64 x 64 pixels, the approach can keep not only the textural and geometrical 
features but increase data’s complexity and dimension. Most of discriminated 
features are twice stronger if it is at a central of images. After extracting all 
patch images needed, image pixel in each channel is normalized to the range of 
(0, 1] in order to decrease the colored intensive rate. Then patch image is resized 
to 256 x 256 pixels, using the bilinear interpolation method. Each image in train 
comprises the 4 patches of an original image so that our network can learn the 
multi deep features and increase the performance. 

Secondly, the discriminated features extracted from fine-tuning VGG16 and 
concatenated of fine-tuning VGG16 & VGG19 transfer learning is classified by 
our novel approach. In this work, all layers before 17*” layer of VGG16 & VGG19 
is freezed and the rest of layers is re-trained. The loss function is a binary cross- 
entropy and the Adam optimizer is applied. All experiments are implemented in 
TensorFlow-GPU version 2 on 16 CPU, 64GB RAM Tesla P4. 
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3.3. Results 


Table 1 shows that the concatenation of many transfer learning features can 
increase the recognition accuracy of breast cancer. To train the deep networks 
efficiently, a large enough dataset is needed so apply the transfer learning is 
nominated approach nowadays. This technique shared the low feature space but 
have many differences about textural and geometrical features between ImageNet 
and BreaKHis. So our approach suggest to train some top layers of VGG16 & 
VGG19 network and achieved the averaged accuracy from 91.7% to 95.0%. 

Both of evaluation method B & C get the average accuracy from 94.9% to 
99.2% which can be applied to quickly detect the cancer if patients present 
any potential signs before doing many costly medical examinations. In order to 
compare our results, we carefully select the works (Table 1) in the state-of-the-art 
with the same decomposition and experimental condition. We can observe that 
the proposed approach clearly outperforms all the previous works. Additionally, 
the local image descriptors based approach does not give a good results compared 
with deep learning based method. Our work is “a plus” since we apply GAN to 
generate more medical images and apply deep learning method to classify images 
(Table 2). 


Table 1. The experimental results of two proposed approaches on BreaKHis dataset. 


Model Evaluation 40x 100x 200x 400x Average 
method 


VGG16 ft + CNN | Method C | 97.5+1.6] 98.3£0.8 | 97.341.7 | 96.841.5 | 97.5+1.4 
Method B | 95.0+1.5 | 95.6£1.8 | 95.4+1.8 | 94.0+1.6 | 95.0+£1.6 
Method A | 91.6+2.4 | 92.2+2.6 | 92.7+2.2 | 89.62.2 | 91.6+2.2 
VGGI16 ft + CNN | Method C_ | 97.341.3 | 98.0-1.3 | 97.4+1.2 | 95.3+2.1  97.1+1.4 
+ StyleGAN Method B_ | 94.7£1.9| 95.7+£1.9 | 95.041.9 | 93.0£2.4 | 94.641.9 
Method A | 90.9+2.0 | 92.0+2.0 | 92.2+1.7 | 89.241.5 | 91.1+1.7 
VGG16 ft + CNN | Method C | 97.5£1.5| 98.5+1.0 | 97.4£2.1 | 95.341.5 | 97.241.5 
+ Pix2Pix Method B_ | 94.9+2.9 | 96.2£1.6 | 95.4£2.2 | 92.8+1.9 | 94.942.1 
Method A | 91.4+3.4 | 92.9+1.8 | 92.8+2.5 | 89.3-1.9 | 91.7+2.3 
VGG16 &VGG19 |Method C | 99.2+1.0 | 99.5+0.6 | 99.2+1.1 | 99.1-+1.3  99.2+1.0 
ft + CNN Method B | 98.2+1.6 | 98.3+1.3 | 98.2+1.3 | 97.5£2.1 | 98.141.5 
Method A | 95.1+3.0 | 95.2+2.4 | 95.2+1.7 | 94.6+£2.9 | 95.0+2.4 
VGG16 &VGG19 | Method C | 98.6£0.8 | 99.0£1.3 | 99.0£1.0 | 98.141.8 98.74£1.2 
NN Method B_ | 96.7+£0.8 | 97.9+1.8 | 97.841.9 | 96.1+2.5 | 97.142.0 
StyleGAN 

Method A | 93.5+3.2 | 95.2+3.0 | 94.4+2.7 | 92.6-£3.5 | 94.0-+3.0 
VGG16 &VGG19 |Method C | 98.8+1.4| 98.8+1.4 | 98.7£1.6 | 97.841.7 | 98.6£1.5 
eae + Method B_ | 97.0£2.6 | 97.3+2.3 | 97.342.0 | 95.5+2.0 | 96.842.2 

Method A | 93.8+3.4| 94.443.1 | 94.2+2.7 | 91.842.8 | 93.6£2.9 
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Table 2. Comparison of the proposed approach with previous works in the state-of- 
the-art on BreaKHis dataset. 


Ref, Year | Method 40x | 100x | 200x | 400x | 2 classes 
2] 2019 | IRRCNN + augmentation 97.9 |97.5 | 97.3 | 97.4 | - 
31] 2019 | Inception & Boosting & Fusion | 95.1 | 96.3 | 96.9 | 93.8 | - 
34] 2019 | ResNet50 + CBAM 91.2 |91.7 | 92.6 | 88.9. - 
23] 2018 | VGG16 (finetuning) + LR - - - - 91.7 
19] 2018 | Active learning 89.4 |90.9 | 91.6 | 90.4 | - 
25] 2018 | CSE (Fish vector) 87.5 | 88.6 | 85.5 | 85.0. - 
26] 2017 | Intra-embedding algorithm 87.7 | 87.6 | 86.5 | 83.9 | - 
30] 2019 | Non parametric 87.8 | 85.6 | 80.8 | 82.9 - 
27] 2017 | DeCaf feature 84.6 | 84.8 | 84.2 | 81.6. - 
29] 2016 | CNN 85.6 | 83.5 | 83.1 | 80.8 - 
28] 2016 | PFTAS 83.8 | 82.1 | 85.1 | 82.3 | - 
18]2019 | BMIC Net - - - - 95.5 
5] 2018 | DMAE 89.8 | 88.0 | 91.5 | 89.2. - 
10] 2018 | MVPNet+NuView data - - - - 92.2 
3] 2018 | Texture Descriptor 91.1 |90.7 | 87.2 |87 - 
13] 2018 | CNN 82.0 | 86.2 | 84.6 | 84.0. - 
17] 2018 | PCANet 96.1 |97.4 | 90.9 | 85.9 | - 
14] 2018 | Multi-task deep learning 94.8 | 94.0 | 93.8 | 90.7 | - 
4] 2018 | Deep VGG16 & Reduction 86.3 | 84.9 | 84.7 | 81.0. - 
33] 2018 | Domain Knowledge - - - - 81.2 
20] 2018 | CNN + Over-sampling - - - - 86.8 
Our- A |VGG16 & VGG19 & CNN 95.1 |95.2 | 95.2 |94.6 95.0 
Our-B | VGG16 & VGG19 & CNN 98.2 | 98.3 | 98.2 | 97.5 98.1 


4 Conclusion 


We proposed a composition method of three techniques, transfer learning, deep 
learning and GAN to be boosting the breast cancer classification accuracy in a 
limited training dataset. We studied two GAN models such as StyleGAN and 
Pix2Pix to boost the medical train dataset. At each training iteration, we com- 
bine the additional fake images of 4,800 generated StyleGAN and 2,912 generated 
Pix2Pix images. The experiments show that GAN images created much noise 
and effected to classification accuracy. Although GAN network can not generate 
the similar structure as original images but it can synthesize some features from 
medical images which proved not to be different accuracy. The future of this 
work is to adjust the U-Net generator in Pix2Pix network to increase a volumes 
of training set and improve the classification performance. 
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Abstract. Understanding students’ learning experiences on social 
media is an important task in educational data mining. Since it provides 
more complete and in-depth insights to help educational managers get 
necessary information in a timely fashion and make more informed deci- 
sions. Current systems still rely on traditional machine learning methods 
with hand-crafted features. One more challenge is that important infor- 
mation can appear in any position of the posts/sentences. In this paper, 
we propose an attentive biLSTMs method to deal with these problems. 
This model utilizes neural attention mechanism with biLSTMs to auto- 
matically extract and capture the most critical semantic features in stu- 
dents’ posts in regard to the current learning experience. We perform 
experiments on a Vietnamese benchmark dataset and results indicate 
that our model achieves state-of-the-art performance on this task. We 
achieved 63.5% in the micro-average F1 score and 59.7% in the macro- 
average F1 score for this multi-label prediction. 


Keywords: Attention mechanism - biLSTMs - Students’ learning 
experience - Social media 


1 Introduction 


Students’ learning experience refers to the feelings/thoughts of students in the 
process of getting knowledge or skills from studying in academic environments. 
It is considered to be one of the most relevant indicator of education quality in 
schools/universities [17]. Getting to understand this is an effective and important 
way to improve educational quality in schools/universities. 

Learning experiences can vary dramatically for students. To determine stu- 
dents’ learning experiences, the widespread used methods is to undertake a num- 
ber of surveys, direct interviews or observations that provide important opportu- 
nities for educators to obtain student feedback and identify key areas for action. 
Unfortunately, these traditional methods usually cost time, thus cannot be fre- 
quently repeated. Moreover, they also raise the question of accuracy and valid- 
ity of data collected because they do not accurately reflect on what students 
were thinking or doing something at the time the problems/issues happened. 
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Another drawback is that the selection of the standards of educational practice 
and student behavior implied in the questions is also criticized in the surveys [5]. 

Nowadays, social sites such as Facebook, forums, blog, etc. provide great 
venues for students to express their opinions, concerns and emotions about the 
learning process. When students post on these sites, they usually write about 
their feelings/thoughts at that moment. Therefore, the textual data collected 
from on-line conversations may be more authentic and unfiltered than responses 
to formal research surveys. These public data sets provide vast amount of insights 
for educators to understand students’ experiences besides the above traditional 
methods. For mining these datasets, there existed several work done for English 
using traditional machine learning classifiers with hand-crafted features. Some 
typical classifiers used in mining various problems in students’ learning process 
are Decision Tree [13], Naive Bayes [6], SVM [8], Memetic [2], etc. In Vietnamese, 
not much effort has been spent to mine such data so far. Tran and Nguyen [14] 
presented the first work towards mining social media to get insights from Viet- 
namese students’ posts. They developed a framework using Naive Bayes and 
Decision Tree to automatically detect students’ issues and problems in their 
study at universities. 

Recently, deep neural network approaches provide an effective way of reduc- 
ing the number of hand crafted features. Specifically, neural networks have been 
proved to improving the performance of many tasks ranging from question gen- 
eration [18], machine translations [7], relation classification [19], etc. Hence, in 
this paper, we propose a novel architecture exploiting a neural network called 
attention-based biLSTMs for mining students’ learning experiences. This model 
doesn’t use any features derived from knowledge resources or Natural Language 
Processing (NLP) systems. We perform experiments on a benchmark dataset, 
and achieve 63.5% in the micro-average F1 score and 59.7% in the macro-average 
F1 score, higher than the existing methods in the literature for this critical task. 

The rest of this paper is organized as follows: Sect.2 presents related work. 
In Sect. 3, we show a proposed method using attention-based biLSTM to deal 
the task. Section 4 shows experimental setups, evaluation metrics, experimental 
results and some findings of this work on a dataset benchmark for Vietnamese. 
Finally, we summarize the paper in Sect. 5 and discuss some on-going work for 
the future. 


2 Related Work 


Social media has risen to be not only a communication media for personal pur- 
poses, but also a media to share opinions about products and services or even 
political issues among its users. Many researches from diverse fields have devel- 
oped tools to formally represent, measure, model, and mine meaningful patterns 
(knowledge) from large scale social for the concerned domains. In healthcare, 
many researches, e.g Sue et al. [12] has shown that social media can be used 
to reveal lots of health information about its users, or to provide online social 
support for anyone with health problems [16]. In the marketing field, researchers 
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mine the social data to recommend friends or items (e.g. online courses, videos, 
beauty product, research papers, search keywords, social tags, and other prod- 
ucts in general.) on social media sites, etc. 

Recently, research on mining web-based conversations in informal ways on 
social media (e.g., Facebook, forum, etc.) has started emerging. From these sites 
there are huge amount of textual data are generated which contain important 
data about students. There existed many researches proposing different tech- 
niques to process such data to better know about students and their learning 
environments. This information will be valuable to institutions/universities to 
make informed decisions related to students’ learning. For example, Chen et al. 
[3] firstly provided a framework for analyzing these kind of data using Twitters’ 
posts for educational goals. Takle et al. [13] did a detailed study to make com- 
parison of different classification techniques such as Iterative dichotomiser (ID3), 
Naive Bayes Multi-label Classifier and Memetic Classifier using common dataset 
to analyze and get the information related to students in order to enhance the 
higher education system, etc. Blessy et al. [2] developed a framework to use both 
qualitative analysis and big data mining techniques using Naive Bayes Multi- 
label Classifier algorithm and Memetic classifier to categorize tweets presenting 
students’ problems. Pande et al. [8] exploited the SVM method to determine 
Many issues like stress, suicide, sleepy problems, and anxiety in students’ posts. 
Patil et al. [9] showed that the way students indicate their feelings via social 
media sites and which posts are in which category using Memetic algorithm. 
Jessiepriscilla et al. [6] built a sentiment analyzer tool for analyzing tweets which 
can be used to accomplish the goal of determining the student learning experi- 
ences using Navie Bayes multilabel classifier. All of these researches were done 
using traditional machine learning methods. 

While most work has focused on English, a few attempts have been done for 
Vietnamese so far. Specifically, Tran and Nguyen [14] presented the first work 
towards mining social media to get insights from engineering students’ posts. 
They developed a framework to automatically detect students’ issues and prob- 
lems in their study at universities. Similar to other work in English, the authors 
also exploited traditional machine-learning methods which are Naive Bayes and 
Decision Tree to build the prediction models. This work also contributed the 
first benchmark dataset on this field in Vietnamese. The experimental results 
were just the preliminary step and need more effort to enhance the performance 
of the methods. 

As can be seen that, previous work mostly exploited traditional machine 
learning methods which require hand-crafted features. Designing these features 
is commonly time-consuming and requires experts’ knowledge. Another chal- 
lenge is that in a post, some words play more important roles in deciding its 
main meanings. Especially, when one students’ post may contain more than one 
meaning. In recent years, deep neural network methods give us an effective way to 
make the quantity of hand crafted features less in size. It also does not use extra 
knowledge and NLP systems. Therefore, this research proposes a novel archi- 
tecture exploiting attentive biLSTM for the task of mining students’ learning 
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experiences on social media. Specifically, we convert the multi-label classifica- 
tion into binary classification problems and then exploit the attentive biLSTM 
to build the corresponding models for these problems. The effective of the pro- 
posed method is verified on a Vietnamese benchmark dataset through extensive 
experiments. 


3 An Attention-Based biLSTM for Understanding 
Students’ Learning Experiences 


In formal statement, multi-label learning problem can be seen as the problem of 
looking for a method that converts inputs x to binary vectors y. These binary vec- 
tors are not scalar outputs as in the single-label classification problem. Learning 
from multi-label classification problem can be solved by transformation tech- 
niques. This technique turns the problem into some single-label classification 
problems. This work uses the techniques called binary relevance. Specifically, 
assume that we have p labels, this method creates p new data sets, one dataset 
for each label. This binary relevance method then trains single-label classifiers 
for each of these new data sets. Each single-label classifier only classifies whether 
or not the current sample belong to the current label i?. The multi-label predic- 
tion for a new sample is determined by combining the classification results from 
all of these independent single-label classifiers. 

Each of these classifiers will be built using attentive LSTM architecture as 
illustrated in Fig. 1. This deep neural network is usually very effective to encode 
sequences of words and is very powerful to learn on data which have long range 
dependencies. It considers each word in the posts with equal importance weight. 
The attention mechanism proposed to allow the model to pay attention to more 
important part of the students’ posts. Therefore, this model can automatically 
concentrate on the important words that have greater impact on the final clas- 
sification, to record the most important semantic information in each post. This 
model does not use any extra knowledge and outputs from NLP systems. The 
overall framework consists of four main layers as follows. 


3.1 Word Embeddings Layer 


Each students’ post consists of n words, s = {w1, wa, ...,Wn}, where w; is the i” 
word of the post. Each word in the posts will be converted into a vector x; using 
word embedding. Word embedding is one of the most effective representation 
of post vocabulary nowadays. It has the capability of encoding the context of 
a word in a post, semantic as well as syntactic similarity, and the relation with 
other words, etc. In this paper, we use GloVe [10] which is an unsupervised 
learning algorithm for capturing representations for words in the vector form. 


3.2. biLSTM Layer 


Let X = (x1,X2,...,Xn) be a students’ post consisting of the vector represen- 
tations of n words in one post. At each location t, the outputs of RNNs express 
an intermediate representation based on h - a hidden state: 


Attentive biLSTMs for Understanding Students’ Learning Experiences 271 


Output 
layer 


Attention (+) 


layer 


biLSTM 
encoder 


Word 
embedding 


Students’ posts 


Fig. 1. An attention-based biLSTM for understanding students’ learning experiences 
on social media. 


¥; = o(Wyh; + by), (1) 


where W, and b, denote parameter matrix and vector. These are determined 
in the training process, 0 denote the element-wise Softmax function. The hidden 
state hy is updated using an activation function. It is a function of the previous 
hidden state h;_; and the current input x; as follows: 


hy = f(ht-1, x2). (2) 


LSTM cells exploit a few gates to update the hidden state h;. These gates include 
an input gate i,, a forget gate f;, an output gate o; and a memory cell c;. The 
update formula is given below: 


1; = a(Wixt Ie Vihi_1 + b;), (3) 


f, = o(Wyx: + Vehi-1 + by), (4) 
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oOo. = a(Wox:t + Vohi-1 se b,), (5) 
Cy = f, © Cr-1 + i; © tanh (WX; + Vey 1 + b.), (6) 
h,; =0;© tanh (ct), (7) 


where © multiplication operator functions, V is a weight matrice, and b is vectors 
to be learned. 

To improve the model performance, two LSTMs are trained on user utter- 
ances. The first on the utterance from left-to-right (1;) and the second on a 
reversed copy of the utterance (r;). The forward and backward outputs, J; and 
r;, Should be combined into c; by concatenation by default before being passed 
on to the next layer. 


3.3 Attention Layer 


Let H denote a matrix including output vectors [h1, he,...,h,] that biLSTMs 
layer produced, where n is the post length. You can just take the straight average 
these vectors and feed that to your classifier. But it is also true that not all of 
this information will be equally important. That is why we need attention to tell 
us which words are less important and which words are the most important. We 
will train a little neural network from H to vote on how important each word 
is. Let r be the representation of the post. r is created by a weighted sum of the 
output vectors as follows: 


M = tanh(H) (8) 
a = softmar(w" M) (9) 
r= Hat™ (10) 


where w is a trained parameter vector and w” is a transpose. A little alpha here 
tells you how important the cell is then you do the weighted sum and feed that 
into your classifier. 
We get the last post-pair representation which will be used to classify as 
follows: 
h* = tanh(r) (11) 


3.4 Output Layer 


This work exploits a softmax classifier! to guess the label y* from a pre-defined 
set of classes Y for a student’s post s. The model gets the hidden state h* as 
input: 

p(y|s) = softmax(W h* + b&)) (12) 


y" = argmaxy(p(y|s)) (13) 
' Instead of using this softmax function, you can also use the sigmoid function as an 


alternative. In fact, in the binary classification both sigmoid and softmax functions 
are the same where as in the multi-class classification softmax function is preferred. 


Attentive biLSTMs for Understanding Students’ Learning Experiences 273 


4 Experiments 


This section first presents about the dataset used to conduct experiments. Typ- 
ical evaluation metrics are also described to estimate the effectiveness of the 
proposed method. Then, the detailed configuration to set up experiments is 
shown. Finally, this section expresses experimental results on this dataset. 


4.1 Dataset 


Data were collected from a forum of a famous university in Vietnam. The dataset 
contains 1834 posts relating to students’ learning experiences of an information 
technology university. In this dataset, one post can fall into one or multiple 
categories. There are seven categories which are also the main problems/issues 
that students often meet in their studying process at the university. Figure 2 
gives a description of the number of instances per labels in our dataset. 


Others 

Diversity issues 
Material resources 
English barriers 
Career targets 
Negative Emotion 


Heavy Study Load 


0 100 200 300 400 500 


Fig. 2. Number of posts in each category of the dataset analyzed. 


4.2. Evaluation Metrics 


The evaluation metrics for the multi-class classification is slightly different with 
metrics for single-label task. In multi-label classification, a misclassification is 
not a hard wrong or right. A predicted set of labels which includes a subset of 
the gold classes should be considered better than a predicted set that does not 
contains any gold class. In this paper, we report both settings to evaluate the 
performance of the method. 
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In this situation, researchers [4] proposed two types of metrics which are 
example-based measures and label-based measures. 


Example-Based Measures. These measures are calculated based on examples 
(in this case each post is considered as an example) and then averaged over all 
posts in the dataset. Suppose that we are classifying a certain post p, the gold 
(true) set of labels that p falls into is G, and the predicted set of labeled by the 
classifier is P, the example-based evaluation metrics are calculated as follows: 


Pines 1 Gin P; 
et Ze ASP 
M 
1 G; NP; 
Prec = VV » B, 
M 
1 G; NP; 
Rec Vi 2 G, 
M . 
Fi= 1 SS 2 * Precision; * Recall; 
M Precision; + Recall; 


i= 
here M is the number of posts in the corpus. 

There are two more commonly used measures to estimate the effectiveness of 
multi-labeled classification which are micro-average F1 and macro-average F 1. 
The former gives the same weight to each classification decision per post, while 
the latter gives the same weight to each label. They are variants of Fl used in 
different situation. 


Label-Based Measures. These measures are measured on each label and then 
get averaged values over all labels in the dataset. Specifically, metrics of recall, 
precision, and F1 for each label / is calculated as follows: 


2*PxR 
F\ = ——_ 
P+R 
TP 
P= —__ 
TP+FP 
R= TP 
- TP+FN 


where TP is the number of posts that are correctly detected as the currently- 
considered label /. FP is the number of posts belonging to | but mis-identified to 
another label. FN is the number of posts of J but not recognized by the models. 
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Table 1. Experimental results of detecting students’ learning experiences using 
example-based metrics 


Methods Accuracy | Precision | Recall F1 micro | Fl macro 
Decision Tree 0.565 0.548 0.571 | 0.583 0.558 
Attentive LSTM | 0.612 0.587 0.629 0.635 0.597 


4.3 Experimental Setups 


The model was implemented in Python programming language with several typ- 
ical libraries such as PyTorch, numpy, sklearn, utils, etc. These libraries provide 
rich tools and options to support developments in NLP and many other research 
fields. 

To create pre-trained embeddings of words, we gathered the raw data from 
Vietnamese newspapers (* 7GB texts) to train the vector of word model using 
Glove”. The quantity of word embedding dimensions was fixed at 50. 

For each label, we created a corresponding dataset which only focuses on the 
currently-considered label. On this dataset, we performed 5-fold cross-validation 
tests to evaluate the performance of the proposed attentive biLSTMs-based 
model on this dataset. The parameters were chosen by using the development 
set. We randomly select 10% of the training data as the development set. To 
detect students’ learning experiences, we set the quantity of epochs equals 100, 
the batch size as 20, early stopping as True with 4-epoch patience, the rate of 
dropout at 0.5. 


4.4 Experimental Results 


In this paper, we compare the performance of the proposed model with the best 
results of previous work on this same dataset. The best performance of previous 
work is using Decision Tree method [14] in the same binary relevance setting. 
In that work, Tran et al. exploited C4.5 (J48). This algorithm is used to build a 
decision tree proposed by Ross Quinlan [11]. C4.5 begins with big sets of cases of 
known classes. These cases are represented by any mixture of properties both in 
nominal and numeric forms. The cases are carefully examined for patterns which 
allow the classes to be reliably discriminated. These patterns are then indicated 
as models that can be later used for classifying new unseen cases. The patterns 
emphasize on the ability of the models to be understandable as well as accurate. 
This C4.5 was ranked top #1 in the best 10 data mining algorithms published 
by Springer LNCS in 2008 [15]. Using this method, the baseline model achieved 
58.3% in the micro-average F1 score, and 55.8% in the macro-average F1 score. 

Table 1 showed experimental results of the baseline and the proposed method 
using example-based metrics. It should be noted that the higher the evaluation 
metrics, the better the performance of the models. As can be seen that the 


? https: //github.com/standfordnlp/GloVe. 
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attentive biLSTM model significantly boosted the performance of this task. It 
achieved the better results by around 4% on all metrics of accuracy, recall, 
precision, macro-average Fl and micro-average F1 scores. Specifically, the F1- 
micro score increased by 5.2% and the F1-macro score increased by 3.9%. This 
result suggested that the attention mechanism has significant effects on mining 
students’ learning experiences in social media. In reality, it is quite effective 
in helping the model focus down on the words that are the most useful for 
classification of students’ learning experiences. 


Table 2. Experimental results of the attentive-based biLSTMs for detecting students’ 
learning experiences using label-based metrics 


Study Negative | Carrier English | Others | Material | Diversity 
load emotion | targets barriers resources | issues 
Precision | 0.832 0.900 0.928 0.948 0.788 | 0.905 0.919 
Recall 0.775 0.923 0.933 0.949 0.792 | 0.892 0.922 
Fl 0.788 0.910 0.921 0.944 0.776 | 0.895 0.914 


Table 2 showed the performance of the attention-based biLSTMs method on 
each label using label-based metrics. We can see that the attentive biLSTM 
model yielded quite high scores. Most labels such as Negative Emotion, English 
Barriers, Carrier Targets, and Diversity Issues got more than 90% in the F1 
score. Material Resources label got 89.5% in the F1 score. For the remaining two 
labels, Heavy Study Load and Others, the proposed method achieved around 78% 
in the F1 score. This result is quite promising due to the ambiguity problem in 
predicting these labels. Observing their samples in the dataset, we saw that these 
samples have a large overlap with the remaining labels. The model, therefore, is 
easy to make mistakes in prediction. 


5 Conclusion 


This paper presented a new approach to the task of determining students’ learn- 
ing experiences on social media. The previous systems still relied on traditional 
methods with manually-designed features. Building these features takes time 
and experts knowledge. One more challenge is that not all words in one post 
have the same important weight to the final prediction of the model. There- 
fore, this paper proposed an attention-based biLSTMs to solve these problems. 
This model utilizes neural attention mechanism with biLSTMs to automati- 
cally extract and capture the most critical semantic features in students’ posts. 
We perform experiments on a Vietnamese benchmark dataset and experimental 
results express that the model achieves SOTA performance on this task for Viet- 
namese. The proposed method improves the performance by a large margin of 
4% in terms of F1-micro score. It achieved 63.5% in the micro-average F1 score, 
and 59.7% in the macro-average Fl score. 
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This result is quite promising and could provide more complete and in-depth 


insights to help educational managers get necessary information in a timely fash- 
ion and make more informed decisions. In the future, we would like to exploit 
another deep neural network architecture to build a multi-label classifier in con- 
sidering all the labels of each post in dependency when training the models. 
Another direction is to investigate more linguistic features to enrich the predic- 
tion models using external resources. 
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Abstract. It is expensive to compute residual diffusivity in chaotic in- 
compressible flows by solving advection-diffusion equation due to the 
formation of sharp internal layers in the advection dominated regime. 
Proper orthogonal decomposition (POD) is a classical method to con- 
struct a small number of adaptive orthogonal basis vectors for low cost 
computation based on snapshots of fully resolved solutions at a particu- 
lar molecular diffusivity Do. The quality of POD basis deteriorates if it 
is applied to Do < D5. To improve POD, we adapt a super-resolution 
generative adversarial deep neural network (SRGAN) to train a nonlin- 
ear mapping based on snapshot data at two values of D>. The mapping 
models the sharpening effect on internal layers as Do becomes smaller. 
We show through numerical experiments that after applying such a map- 
ping to snapshots, the prediction accuracy of residual diffusivity improves 
considerably that of the standard POD. 


Keywords: Advection dominated diffusion - Residual diffusivity - 
Adaptive basis learning - Super-resolution deep neural networks 


1 Introduction 


It has been a fundamental problem to characterize the large scale effective dif- 
fusion in fluid flows containing complex and turbulent streamlines [17]. In this 
paper, we consider the passive scalar model [11]: 


T; + (v-D)T = Do AT, (1) 


where T is a scalar function, Dp > 0 is a constant (the so called molecular 
diffusivity), v(a,t) is a incompressible velocity field, D and A are the spatial 
gradient and Laplacian operators. In two dimension, the effective diffusivity ten- 
sor is given by [1]: 

Dij = Do (dij + (Dw; - Dw;)), (2) 
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where w = (w1, w2) is the unique mean zero space-time periodic vector solution 
of the cell problem [1]: 


wit (v- Dw) — DoAw = —v, (3) 


and (-) denotes space-time average over the periods. The term (Dw; - Dw;) in 
(2) is a positive definite correction to Do 6;;. 

Asymptotic behavior of Dz can be solved when the flow is steady and peri- 
odic. For instance, the time independent cellular flow [4,5,13,19, 20] 


v =(-H,,Hz), H =sinesiny, 


has been proved to generate effective diffusion following the square root law with 
dominated advection [5,6]: 


DE =O(VDo) > Do, Dol0, Vi,i. 


However, when the streamlines are time-dependent or fully chaotic, the 
enhancement can be quite different and difficult to solve analytically. A sim- 
ple example is 


v = (cos(y),cos(x)) + 6 cos(t) (sin(y),sin(z)), 6 € (0, 1], (4) 


where the first term is a steady cellular flow with a 7/4 rotation and is perturbed 
by the a time-periodic flow. At 0 = 1, the flow (4) is fully chaotic [21] and was 
investigated in the Rayleigh-Bénard experiment [3]. Numerical simulations of 
the fully chaotic model [2,10,18] suggest that 


DE, =0(1), Do LO, (5) 


hence the residual diffusivity phenomenon emerges. 

As in (4)v is periodic in time, the solution of cell problem (3) subject to 
periodic boundary condition in space can be computed accurately by spectral 
method in Fourier basis. By expanding both w and v as Fourier series, (3) 
is equivalent to an ordinary differential equation (ODE) system. To solve the 
system numerically, one can approximate w using finitely many Fourier modes 
thereby the problem is reduced to solving for the periodic solution to a linear 
ODE system. The corresponding Poincaré map is constructed in [10] and the 
solution is found as the unique fixed point of it. The effective diffusivity D” can 
be finally computed by (2). The drawback of the spectral approach is that the 
number of Fourier modes required by the truncated problem grows rapidly as 
Do | 0 due to the sharp gradient of the solution. 

n [10], adaptive orthogonal basis vectors are constructed from snapshots 
of spectral solutions to handle the near singular solutions of (3) at small Do. 
Particularly, at certain Dp and @ sample, snapshots of the solution in one time 
period form a solution matrix and the adaptive basis consists of singular vectors 
for the top singular values of the matrix. Hence the linear ODE system with Do or 
@ that are close to the sampled values can be rewritten in terms of adaptive basis 
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and solved similarly with Poincaré map. The number of adaptive basis functions 
is one or two hundred, far less than that of Fourier basis which is usually at least 
a few thousand. With the reduced adaptive basis, the relative error of computing 
residual diffusivity is no more than 6.5% with carefully selected samples. 

The above procedure to generate adaptive basis vectors by taking snapshots 
of solutions and solving singular value decomposition (SVD) is the so called 
reduced order modeling [14, 16]. In fluid dynamics literature [7,9], such technique 
is also referred to by proper orthogonal decomposition (POD). However, POD 
method relies heavily on collection of good snapshots data. Snapshots data with 
less representation capability would hardly recover the fully resolved solution. 
As shown in the test results of computing residual diffusivity, when snapshots 
are collected at certain Do and the solutions at a much smaller D; < Do are 
to be computed with reduced basis constructed at Do, the errors can rapidly 
increase. 

In this paper, we study a deep neural network (DNN) approach to alleviate 
the accuracy loss of POD and improve the error of reduced basis computation 
at, D, based on prediction from snapshots at two values Dj and D2 (both above 
D,). The idea is to train a mapping from snapshots (images) at Dj to those at 
D2 (Dg < Dj). The mapping sharpens the images similar to what happens to 
solutions of (1) as Do becomes smaller. The mapping is applied to snapshots at 
D3 for improving POD basis construction. As a proof of concept, we select a 
super-resolution DNN in the form of a generative adversarial network (SRGAN, 
[8]). We show that it serves our purpose well through numerical experiments 
where the mapping is constructed (trained) based on snapshots at Dj and D2, 
then tested (applied) at D, < D2. 

The paper is organized as follows. In Sect.2, we describe the POD basis 
construction for (2), SRGAN architecture and its training objective. In Sect. 3, 
we present computational results on predicted residual diffusivity from POD 
basis with and without SRGAN. Concluding remarks are in Sect. 4. 


2 Construction of Adaptive Basis via DNNs 


2.1 Learning Thinner Structures 


Consider spectral method for solving the cell problem below for D/: 


wit(v-V)w— Do Aw = -2, (6) 
where v = (v, 0), and 


uv (x,t) = cos (a2) + sin (a2) cos (t), 


(7) 


0 (x,t) = cos (a1) + sin (a1) cos (t). 


Let wh’ be the k-th mode of a (2N +1)? term Fourier approximation of w on the 
[0, 27]? periodic domain [10]. Let vz, and @% be the k-th Fourier modes of v and @ 
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respectively. In view of (7), vg (%) equals zero unless & = (0, +1) (k = (+1,0)). 

The truncated ODE system on wy is: 

dwy 
dt 


+ Do |k|? wy +4 > [(¥1 — 51) vj (t) + (ke — ja) 8; (¢)] we_; = —ve (t), 
|kK-jI| SN 
(8) 


which reads in vector-matrix form: dw/dt = A(t) w+ v(t). For a time dis- 
cretization with N; grid points on [0,27], denote by {we}, a numerical time 
periodic solution to (8) for Do = Dj, a value where snapshots data are collected. 
Such a solution can be found by solving for the unique fixed point of the Poincaré 
map [10]. Define the solution matrix of size (2N + 1)? x Ni: 


W= [we wi... wh], (9) 


and compute the SVD factorization W = UX V7. Then one extracts columns 
u; (j = 1,---,m) of the matrix U corresponding to the largest m < O(N?) 
singular values, to form the adaptive orthogonal basis vectors and the matrix: 
Um =[U1 Ug ... U,]. This is the end of basis training at a sampled value 
Dj. At Do # Dj, project w(t) to the span of column vectors of Um, or seek a 
vector of the form U;, a(t), where a(t) € R™ satisfies the ODE system in a much 
lower dimension (bar is complex conjugate, T is transpose): 

= = UT A(t)Uma+t Uz,v (t). (10) 
Finding the time periodic solution to a time discrete version of (10) via Poincaré 
map to compute D” by (2) in Fourier space, we completed the reduced order 
modeling. 


The inverse Fourier transform of W gives the snapshot matrix in the physical 
domain: 


Sa w. Ty (11) 


where each column vector (snapshot) is an image after reshaping into a square 
matrix. The S, is convenient for visualization and drawing a connection with 
image processing. Figs.1 and 2 illustrate that internal layers in the physical 
domain snapshots (column vectors of S,) emerge and get thinner as Do becomes 
smaller. For a better prediction of the nearly singular solutions at D, much 
smaller than D6, it is helpful if the adaptive basis learned at Dj encodes certain 
thinner layered structures. Particularly, given the solution matrix W at Dj as 
(9), we look for a map M such that the physical domain snapshots (inverse 
Fourier transform) of M (W) have sharpened internal layers. 

Suppose Dj > D2 are two values > Dj, and W’* is the Fourier domain 
solution matrix at Dj, for i = 1,2. Let F be column-wise Fourier transform 
on matrices, then columns of F~' (W’") are physical domain snapshots of W’. 
When the W’’s are available, we use DNN to train a map 7 for the following 
regression problem: 

T:F7'(W*) > F-* (W’). (12) 
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Fig. 1. Sampled snapshots of (6) at Do = 10~? with layered structures. 


Since Dj < Dj, F~! (W?) has thinner layered structures than F~! (W'). By 
solving the regression problem (12) via DNN, our goal is that the T (F~' (W')) 
inherits the image sharpening capability. Then TJ can be applied to solution 
matrix W* at Dj < Dj and T (F~'(W*)) is expected to have thinner struc- 
tures for better prediction of residual diffusivity. Finally, the adaptive basis with 
thinner structures will be obtained from SVD of 


MW) =F F*(W"))). 


2.2  Adversarial Network 


We opt for the super-resolution generative adversarial network (SRGAN) [8] to 
train the map T. As a generative adversarial network (GAN), SRGAN consists 
of a generator network G and a discriminator network D. The two networks 
compete in a way that D is trained to distinguish the real high-resolution (HR) 
images and those generated from low-resolution (LR) images, while G is trained 
to create fake HR images from LR images to fool D. We train the SRGAN with 
Lae (W?) as input data and F—1 (W?) as target data so that the generator 
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Fig. 2. Sampled snapshots of (6) at Do = 10~*, formation of thinner layers. 


G learn to generate thinner structures when it is fed with F~'(W). In this 
approach, we realize TJ through a trained G. 

The network architecture is shown in Fig. 3. The generator network G starts 
with a convolutional block with kernel size 9 x9, followed by a few residual blocks. 
Here a convolutional block consists of a convolutional layer and a PReLU layer, 
a residual block is a convolutional block with kernel size 3 x 3 followed by a 
convolutional layer of the same kernel size and a shortcut from the input to 
output. There are two more convolutional layers with kernel size 3 x 3 and 9 x 9 
after the residual blocks at the end of the network. The number of filters in all 
convolutional blocks are the same except for the last one. Note that we remove 
the two upscale layers in [8] since the snapshot sizes of F~! (Ww?) and F-1 (W?) 
are the same. 

The discriminator network D is defined by the architectural guidelines sum- 
marized in [12], see Fig. 3. It has eight convolutional blocks with PReLU layers 
replaced by LeakyReLU layers with slope parameter a = 0.2. Moreover, there is 
a batch normalization layer before each LeakyReLU in the convolutional blocks. 
The kernel size is 3 x 3 in all convolutional blocks and the number of filters 
is doubled in the 3rd, 5th and 7th block. Those blocks are followed by a fully 
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connected layer, a LeakyReLU layer and one more fully connected layer. Finally 
the feature map is fed in a sigmoid layer which gives the probability of real HR 
image and the reconstructed image. 


@convad GP PReLu HHBN J LeakyRetu HH Fc H sigmoia 
@ Residual Block (Conv Block {LRImage {HR Image 


7 
e a ra 
xe £ 


(d) 


Fig. 3. Architecture of the generator and discriminator networks. (a) The generator 
network. (b) The discriminator network. (c) Residual block in the generator network. 
(d) Convolutional block in the discriminator network. We use a simple notation to 
indicate the Conv2d layer. For example, k9f64s1 indicates a convolutional layer with 
kernel size 9, number of filters 64 and stride 1. 


As a binary classifier, the discriminator network is equipped with the cross 
entropy loss. Let us focus on the loss function of the generator network. Suppose 
F-) (W!) and F—! (W?) are real matrices of dimension (2N + 1)? x N;,, and the 
columns of F~! (W') and F~1 (W?) are x; and y;, i = 1,2,...,N;. Following 
the formulation in [8], we define the loss function of the generator network as 


1(G) =luse(G) +1077 lyea (G) + 1073 Ieen (G). (13) 
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In (13), large is the pixel-wise MSE loss defined as the sum of the squares of 
error at each pixel, 


Nt 
luse (G) => lly — G (a) ll - 
i=1 


The lygq is the VGG loss based on layers of the pre-trained VGG-19 network 
[15]. Let ¢ be a feature map of VGG-19 and s,4 be its size, then the VGG loss is 
the average of squares of Euclidean distances between the feature representations 
of y; and G (a;) 


Ni 
vac (G) = 85") Id (yi) — 6 (G (@))Ila- 


The generator network is expected to fool the discriminator network, so (13) 
contains [Gen called generative loss. The [@-, is defined based on the cross- 
entropy loss of the discriminator network 


Ne 
dbs 1 — D(G(x))], (14) 


where D(G(x;)) means the binary classification result of the reconstructed HR 
image by the generator network G. In practice, we define 


Nt 


laen (G) = - —log D (G (ai) 


i=l 


for better gradient behavior. 


3 Experimental Results of Adaptive Basis from SRGAN 


Let Dj = 10~?, Dé = 1073. We solved for both W! and W? via spectral method 
with N = 50 and N; = 1500, then train SRGAN with input data F~* (W') and 
target data F~' (W?). The training of SRGAN includes two stages: (1) We train 
the generator G for 50 epochs to get a pretrained model; (2) We train the entire 
SRGAN for 200 epochs. Adam and SGD optimizers are applied to training of G 
and D respectively. We set batch size to be 32 and learning rate to be 1074 for 
both optimizers. The training was carried out on a desktop with Nvidia graphics 
cards GTX 1080 Ti. We set Dj = D@ in the following experiments. 

Figure 4 shows two time slices of the input F~! (W?) (top) with G (F—! (W?)) 
(bottom) at D = DZ. Columnwise, it can be seen that thinner layers are created 
by the network G. Due to identical dimension constraint of the input and output 
images, the up-scaling layers in SRGAN [8] have been removed. This adaptation 
lowers the fidelity of the generated images, as we see in each column of Fig. 4. How- 
ever, the SRGAN generated snapshots are only used to construct reduced basis. 
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Fig. 4. Input (top) and SRGAN output (bottom) at D? = 107°. 
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The fact that sharp layers are generated by SRGAN training is more important 
for our task of computing residual diffusivity. 

Set Dj = 10~? and D? = 107%, the comparison of predictions of DE, by SVD 
and SRGAN assisted SVD is shown in Table 1. The number of adaptive basis is 
m = 100 for both methods. 


Table 1. Comparison of Die for flow (7) with Di = 1077, Dé = 107°. 


Do 5x 10°-4|4x 10-4 3x 1074/2 x 10-4} 1074 
DOF as 1.3847 [1.3940 |1.4105 |1.4395 | 1.4951 
Ons SVD 1.5258 |1.5597 |1.5969 |1.6381 | 1.6854 
SRGAN | 1.2429 | 1.2663 1.3056 | 1.3786 | 1.5293 
Relative error | SVD 10.2% 11.9% 13.2% 13.8% 12.7% 
SRGAN | 10.2% | 9.2% 74% |4.2% |2.38% 
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Fig. 5. Singular vectors of W? (left col.), F (G(F~* (W”))) (right col.). 


When Dj is closer to DZ, SRGAN assisted SVD may have even better pre- 
dictions at smaller Do. In Table2, Di =5x 10", Di = 10? and N = 50-and 
we predict the D¥, gq at Dy = 3x 10-4, 2 x 10-4 and 9-4, For DE = 5x 10-3, 
D2 = 1073, singular vectors of F (G (F- (W?))) also have thinner es 
than that of W?, as shown in right column and left column of Fig. 5 respectively. 
Table 3 summarizes predictions for Dp = 2 x 10~° and 10~° from Dj = 107%, 
D2 = 10~* and N = 60. 
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Table 2. Comparison of Die for flow (7) with Dj = 5 x 107°, D§ = 107°. 


Do 3x 1074 2x 107-*/} 1074 
Di 80 1.4105 1.4395 | 1.4951 
Do SVD 1.5969 (1.6381 | 1.6854 


SRGAN | 1.3111 1.3862 1.5015 
Relative error | SVD 13.2% 13.8% 12.7% 
SRGAN | 7.0% 3.7% 0.4% 


Table 3. Comparison of Dy for flow (7) with Dj = 107%, Dj = 107+. 


Do 2x 10-°|10-° 
Dii,60 1.6052 | 1.6301 
Dae SVD |1.5107 | 1.5243 


SRGAN | 1.6234 1.7120 
Relative error SVD 5.9% 6.5% 
SRGAN | 1.1% 5.0% 


4 Conclusions 


Based on snapshots at two molecular diffusivity values, we trained an adapted 
super-resolution deep neural network (SRGAN) to model the internal layer 
sharpening effect of advection-diffusion equation as a nonlinear mapping. The 
mapping improves the quality of standard POD basis for low cost computation 
of residual diffusivity in chaotic flows. Though no other DNN model is known 
to assist POD in our setting, we shall explore how to improve the fidelity of the 
generated images in the current model. 
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Abstract. Predicting stock prices is a challenging task due to the highly 
stochastic nature of the financial market. Among many proposed quan- 
titative approaches to tackle this problem, machine learning, in recent 
years, has become one of the most promising methods. However, machine 
learning is still new to a large part of Vietnamese investors community. 
This motivated us to take some first steps in using machine learning 
techniques on Vietnamese stock data, in particular top 20 listed stocks 
(according to market capitalization) of VN-Index in June 2019. The 
experimental results suggest that machine learning and hybrid methods 
give better performances in forecasting stock price fluctuation than ones 
achieved by traditional methods such as the Autoregressive Integrated 
Moving-average model. To realize our study, we implement a web-based 
tool and release its source code. 


Keywords: Machine learning - ARIMA - ANN - Hybrid model - 
Vietnamese stock market 


1 Introduction 


Investing in stock exchange is one of the most risky and tough activities as it 
can yield in complete loss for investors. For the sake of reaping high profit, the 
investors should employ more and more efficient investment strategies. These 
strategies based on three different trading schools of thought: fundamental anal- 
ysis, technical analysis, and quantitative technical analysis!. The fundamental 


' https: //muse.union.edu/2019capstone-hladikl/methods-of-stock-market- 
prediction-2/. 
© Springer Nature Switzerland AG 2020 


H. A. Le Thi et al. (Eds.): ICCSAMA 2019, AISC 1121, pp. 291-303, 2020. 
https://doi.org/10.1007/978-3-030-38364-0_26 


292 T.-P. Nguyen et al. 


analysis involves the examination of the economic factors, such as balance sheets 
and income statements, that influence stock prices. While the technical anal- 
ysis uses different types of indicators derived from the history of stock price 
and volumes to predict future values of stock price. Finally, the quantitative 
analysis”, machine learning in particular — which we intend to use on our study, 
is a group of techniques that seek to understand stock market behaviors by using 
mathematical and statistical modeling, evaluation method, and the studies in the 
literature. 

For the past decades, machine learning has been developed not only in the- 
ory, but also in industrial applications. In time series forecasting, some common 
machine learning methods that have been developed are Artificial Neural Net- 
works (ANN) and Genetic Algorithms (GA), see [9]. Another form of ANN that is 
more appropriate for time series prediction is Recurrent Neural Network (RNN), 
see [12]. The use of hybrid models between to enhance accuracy prediction can 
be seen in [10,13]. 

In time series forecasting, Autoregressive Integrated Moving-Average 
(ARIMA), a time series analysis method, has been applied widely for some sorts 
of time series data. Adebiyi et al. [1] studied the application of this method on 
stock price data. The study was conducted on public stock data obtained from 
the New York Stock Exchange (NYSE) and the Nigeria Stock Exchange (NSE). 
The authors used the Box-Jenkins approach in [3] to narrow down the collection 
of orders of ARIMA models. The results revealed that an ARIMA model will 
potentially yield better results in predicting stock prices on a short-term basis. 

For investing tools, Kimoto et al. [8] presented a buying- and selling-time 
prediction system for stocks on the Tokyo Stock Exchange. The system operated 
in two stages combining Principal Component Analysis (PCA) and ANN. The 
study used PCA to select proper inputs from technical indicators to forecast the 
future values. The prediction system achieved highly accurate predictions and 
reaped excellent profit via the stock trading simulation. 

In 2003, Zhang [14] proposed a hybrid method combining an ARIMA model 
and an ANN model. The author firstly assumed that the method is capable of 
taking advantages of the unique strength of ARIMA and ANN models in linear 
and non-linear modeling. In order to demonstrate the effectiveness of the hybrid 
method, this study conducted experiments on three well-known dataset including 
Wolf’s sunspot data, the Canadian lynx data, and the British pound/US dollar 
exchange rate data. The experimental results showed that the combined method 
could be an effective way to improve the accuracy of prediction achieved by 
either of the models when used separately. 

The work of Atilla et al. [2] provided a performance comparison of various 
combinations of ARIMA models, ANN models, Radial Basis Function Network 
(RBFN) models. The dataset consisting of the number of monthly tourist arrivals 
to Turkey between January 1984 and December 2003, was used for this study. 
The outcomes of this work indicated that the models with non-linear component 
gave a better forecasting performance. 


? https: //www.investopedia.com/terms/q/quantitativeanalysis.asp. 
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Regarding predicting Vietnamese stock behaviors, there have also been a few 
studies [5,11]. In [5], the author proposed a hybrid method combining Gaussian 
Process Regression (GPR) and Autoregressive Moving Average (ARMA). The 
result of experiments conducted on close price of VN-Index showed that each 
method when used separately yielded lower performance than when combined 
together. In [11], the authors employed an Firefly algorithm, in [7], in the pro- 
cess of training an Adaptive Neuro-Fuzzy Inference System (ANFIS). The study 
also conducted a performance comparison between this proposed system and 
ones that employ a hybrid algorithm, a back propagation algorithm and par- 
ticle swarm optimization algorithm in training ANFIS on data of some stocks 
of Hanoi Stock Exchange (HNX). The experimental results indicated that the 
proposed system outperform the rest in terms of both performance and running 
time. 

Our contributions are as follows. Firstly, each method mentioned above, 
which has its own advantages and disadvantages, is only applied on a partic- 
ular dataset. Therefore, in this paper, we conduct a comprehensive comparison 
about the performance of three forecasting methods including ARIMA, ANN 
and the hybrid model of ARIMA and ANN on Vietnamese stock market, in 
particular, historical data of 20 stocks of VN-Index from January 2016 to June 
2019. 

Secondly, we aim at building a user-friendly tool for Vietnamese investors. 
Besides the main features that are to predict stock prices by three models used 
in this paper, we also append some needed features such as displaying real-time 
data, plotting some indicators (Moving average, Boillinger bands, etc.). Figure 4 
shows the graphical user interface of our application. The details of this tool will 
be described in the last part of Subsect. 3.4. 

The rest of our paper is organized as follows. In Sect.2, we briefly recall 
the basic concepts about ARIMA, ANN and the hybrid model of ARIMA and 
ANN. In Sect.3, we describe our data preparation, prediction procedure and 
the result of empirical experiments. Finally, Sect.4 summaries the paper and 
envisions some future development directions. 


2 Methodology 


In this section, we briefly recall some of the fundamental concepts that will be 
used in subsequent sections. 


2.1 Autoregressive Integrated Moving-Average 


In an Autoregressive Integrated Moving-average model, future values are linearly 
dependent on several past values and random errors. Namely, this model includes 
three components which are Autoregressive (AR), Integrated (J) and Moving- 
average (MA). The underlying process generating the time series is written in 
the form of 


Lp = A Ly_-1 + QTp—-g + +++ + ApLe—p + € + Gretr—1 + Boer_2 +°--+ Bget—q; 
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where x; and e are the actual value and error at the time t, respectively; 
{aj}i=1...p and {G;};=1...4 are the coefficients (model parameters); p and q are 
integers and usually referred to as orders of the model (hyperparameters). The 
random errors, €;, are assumed to be independently and identically distributed 
with a zero mean value and a constant variance of o?. Furthermore, the J com- 
ponent is the number of times that the time series is taken differences in order to 
obtained a stationary time series. The order of this component is denoted by d. 

Stationarity is a necessary condition of building an ARIMA model that is 
valid in terms of statistics [3]. A time series is called stationary when it’s statisti- 
cal characteristics, such as mean and autocorrelation structure, remain constant 
over time or at least in a long-term. Once the observed time series shows a trend 
and heteroscedasticity, differencing and log transformation are applied to the 
time series in order to remove the trend, and stabilize the variance before fitting 
the ARIMA model to the data. 

When the ARIMA model is determined, which means that a set of orders is 
identified, the model parameters will be estimated. The estimation is conducted 
by minimizing an overall measure of errors. This can be done by iterative opti- 
mization procedures. 

There are several sets of orders for ARIMA models that should be listed. 
The method used in order to specify a collection of sets of orders will be discuss 
in Sect.3. Then, by validating the ARIMA model with respect to their sets of 
orders, the best model will be selected for forecasting purposes. 


2.2 Artificial Neural Network 


Multilayer Neural Network. Multilayer neural network is a form of an ANN 
containing perceptrons. These perceptrons are organized into many layers. A 
multilayer neural network architecture includes at least three layers: one input 
layer, one output layer, and one or more hidden layers. We will discuss the 
method we specify the architecture of a neural network in detail in Sect. 3. 


Activation Function. An activation function is a non-linear function that 
helps the network approximate any non-linear functions. In our experiments, we 
use the Tanh function f(z) = — as an activation function. 
Backpropagation. Backpropagation is a method used in ANNs to calculate a 
gradient that is needed in the adjusting of the weights to be used in the network 
[4]. When we attempt to update the weights, we also minimize the error. To do 
this, one commonly used algorithm is gradient descent. 


2.3. Hybrid Model of ARIMA and ANN 


The hybrid approach of ARIMA and ANN combines all advantages of both 
ARIMA model and ANN model, producing better performance on some sort 
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of time series data, see [14]. The main idea of this method comes from two 
perspectives. First, it is really difficult to determine whether a time series is 
generated by a linear or non-linear underlying structure in a process. Second, 
practical time series are rarely pure linear or non-linear. Therefore, it is assumed 
that every time series contains two components: a linear component and a non- 
linear component. Given a time series {z;}:-1..r that is composed of linear 
autocorrelation structure and non-linear component 


v=lh4+M+e, 


where L; represents the linear part, N; represents the non-linear part and «&; 
is a random error. First, the ARIMA model will be applied to capture the lin- 
ear part, then the residuals from the linear model will contain only non-linear 
relationships. Let e, denote the residuals at the time t, then 


ep = x, — Lt, 


where re is the forecast value at the time t from the ARIMA model. Now, the 
residuals will be captured by the ANN model with n input, 


ey = f(et-1, C425 cory €t—n) + Et, 


where f is the non-linear function that is determined by the ANN model. We 
should notice that “if the function f is not an appropriate one, the error term is 
not necessarily random. Therefore, the correct model identification is critical”. 
Denote N; the forecast value from f, then the predicted value is 


Lt -i,+™,. 


The hybrid model exploits the unique features and combines the strength of 
both ARIMA and ANN models. However, this method is not universal in real- 
life problems [14]. 


3 Empirical Results 


3.1 Data Collection 


In order to conduct our experiments, stock price data was collected from the 
website investing.com® which is a global financial portal and internet brand 
owned by Fusion Media Limited. We collected data from this site because we 
found that AJAX requests for historical data of this website is easy to mimic. 
To crawl data from Investing website, we used Selenium framework to get 
HTML contents. Then, we parsed HTML contents by using BeautifulSoup 4. 
The obtained dataset is organized as shown in Table 1. We collected historical 
stock price data of top 30 stocks of VN-Index from January, 2016 to June, 2019. 
Readers could find the public dataset on our GitHub*. However, in the scope of 
this paper, we only use the top 20 stocks in Table 4 to conduct experiments. 


3 https: //www.investing.com/indices/vn-30-components. 
* https: //github.com/thanhphuong163/vn_stock_prediction/tree/master/Data. 
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Table 1. An example of historical data in database. 


date name ticket index 
22-05-2019 Cotec Construction JSC CTD VN 30 (VNI30) 
22-05-2019 Mobile World Investment Corp MWG VN 30 (VNI30) 
22-05-2019 Phu Nhuan Jewelry JSC PNJ VN 30 (VNI30) 
22-05-2019 No Va Land Investment Group NVL VN 30 (VNI30) 

close open high low volume change_per 

115300 115100 116000 115100 379500 -0.35 

87000 87000 87983 86705 6731400 0.34 

107400 106400 109000 106400 6358500 0.94 

57700 59000 59200 57700 3901800 -2.2 


3.2 Data Preprocessing 


To feed data to machine learning models, such as an ANN, it is necessary to 
transform time series datasets into supervised learning datasets. 

The key function for transformating a time series dataset into a supervised 
learning dataset used in this paper is the shift () function of Pandas framework. 
This function can create copies of a time series which are pushed forward (nan 
values will be put to the front) and pulled backward (nan values will be put to 
the end). Copies of the time series which are pushed forward will be columns of 
lag observations. These columns will be considered as input patterns. Meanwhile, 
copies that are pulled backward will be columns of forecast observations, also 
known as output patterns. 

Table 2 illustrates an example of transformation from a time series dataset 
into a supervised learning dataset. We have a time series on the left table, then 
we re-frame it into supervised learning data on the right table. The columns 
lag_1 and lag_2 are copies of the origin time series which were pushed forward. 
They represent the price in the past. The column forecast_O is the original 
time series and represents the price at present. The columns lag_1, lag_2 and 
forecast _0 are considered as input features of a machine learning model. Finally, 
the column forecast_1 is the forecast observations or output. 


3.3. Evaluation Method 


Mean Absolute Percentage Error. Mean absolute percentage error (MAPE) 
is a measure of prediction accuracy of forecasting models. It is usually used as 
a loss function for regression problems in machine learning. This measure is 
computed by the following formula 
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Table 2. Converting time series data into unsupervised learning data. 


Date Value Date lag_2 lag_1 forecast_0 | forecast_1 
2019-06-14 | 104300.0 | 2019-06-14 | nan nan 104300.0 | 106300.0 
2019-06-13 | 106300.0 | 2019-06-13 | nan 104300.0 | 106300.0 | 106800.0 


2019-06-12 | 106800.0 | 2019-06-12 | 104300.0 | 106300.0 | 106800.0 | 106500.0 
2019-06-11 | 106500.0 | 2019-06-11 | 106300.0 | 106800.0 | 106500.0 | 107000.0 
2019-06-10 | 107000.0 | 2019-06-10 | 106800.0 | 106500.0 | 107000.0 | 107000.0 
2019-06-07 | 107000.0 | 2019-06-07 | 106500.0 | 107000.0 | 107000.0 | 108000.0 
2019-06-06 | 108000.0 | 2019-06-06 | 107000.0 | 107000.0 | 108000.0 | nan 


n 


Wahi” S- 


mr 
t=1 


ot — Ut 
Yt 
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where y; is the actual price and % is the predicted price. The error is calculated 
by taking the mean of the absolute differences of % and y, divided by y; again; 
multiplying 100 to make it a percentage error. 


3.4 Time Series Cross-Validation Procedure 


Time series cross-validation is a more sophisticated version of cross-validation [6]. 
To conduct this type of cross-validation, the data is organized into a series of 
training/test set splits as depicted in Fig. 1. For each training/test set split, we 
fit model with a collection of hyperparameters into a training set. The fitting 
models will be used to predict the price of the next day corresponding to the 
price in the test set. Once we obtain the predictions, we compute the error of each 
prediction and choose the least value among these errors. The same procedure 
is applied to the rest of training/test set splits. The total error of model is the 
mean of these least errors over the series training/test set splits. 

In the rest of this subsection, we describe how we specify a collection of 
hyperparameters for each model used in our experiments subsequently. 


Validation of Autoregressive Integrated Moving-Average. We use two 
techniques including the trial-and-error technique and the grid-search technique 
to determine the model which yields the best performance. An ARIMA model 
is specified by a set of hyperparameters (p,d,q), where p is the order of the AR 
component; d is the order of the J component; and q is the order of the MA 
component. By using the trial-and-error technique, we observed that the orders 
of the AR model and the MA model in ARIMA model usually fall into range 
[0,3] and range [0,6], respectively. Furthermore, the bigger order is, the longer 
training time is. 
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Available data 


Train Test 


Fit model Predict price Compute error 


Tunning 
hyperparameter 


Are there 
any unused set of 
hyperparameters? 


Choose 
the least error 


Hypermarameter selection 


Train Test 


Fig. 1. The illustration of time series cross-validation. 


Validation of Artificial Neural Network. With an ANN model, we use 
the trial-and-error approach to determine the proper architecture for fitting the 
model. After many trials, we find that ANN models including two hidden layers 
produced better results than others, and the number of units in each layer was 
approximately 10 units. Another observation is that the models with one to 
three input units yield better results than the ones using a larger number of 
input units. From these observations, we employed the grid-search technique for 
seeking the best model with the number of units in the first hidden layer from 
three to seven and the number of units in the second layer from one to three. 
An ANN model is specified by a set of hyperparameters (i, k, 2), where i,k,é are 
the number of units in the input, the first hidden and the second hidden layer; 
and the output layer only consist of one unit. 


Validation and Implementation of Hybrid Model of ARIMA and 
ANN. Firstly, we fit ARIMA models to the data and selected the best model 
fitting to the linear component of the time series. Next step, we compute the 
residuals (non-linear component of the times series) on training set to be fitted 
by ANN models. To find the best ANN model, we still employ the grid-search 
technique with the same setup as in the second experiment (the one that used 
an ANN model). To the best of our knowledge, this model has no public source 
code or built-in model in any frameworks. Therefore, we implement this model 
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ourselves based on description of Zhang [14], the implementation? is conducted 
following the flow described in the diagram in Fig. 2. 


3.5 Results and Analysis 


To illustrate the numerical results in this paper, we will use the close prices of 
Phu Nhuan Jewelry JSC. Figure3 shows the differences between the predicted 
prices and the actual prices of three models. 


Historical data naa ARIMA model land Predicted values —> Residuals 


ANN model 
a a 
_ Predicted price < 4) < Predicted values 


Fig. 2. Diagram of the hybrid ARIMA+ANN model. 


The red solid line is the actual prices, 
The blue dash-dot line is the one-step predicted prices of the ARIMA model, 
The yellow dashed line is the one-step predicted prices of the ANN model, 


The green dotted line is the one-step predicted prices of the hybrid model of 
ARIMA and ANN. 


82000 — actual 
—-: ARIMA 
» ANN 
80000G 0 PR ee Hybrid 
e) 
S 78000 
o 
2 
2 76000 
74000 


Fig. 3. Model comparison of three models on Phu Nhuan Jewelry JSC. 


° The private repository of the tool is located at https: //github.com/thanhphuong163/ 
vn_stock_predictiontree/master/Models. Access to this repository is granted upon 
requests. 
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In the figure, the predicted values of the ANN model and the hybrid model 
seem to be closer to the actual values than the ARIMA model. Indeed, Table 3 
tells us that the hybrid model is the best fitting model. 

Table 3 shows the MAPE of each model applied on the top 20 stocks of VN- 
Index. The MAPEs which are bold denote that they are the best performance 
of three models for each stock. ANN models outperform on nine stocks, and the 
hybrid models give better performance on 11 stocks. This result suggests that 
ANN and the hybrid models give the best performance on the majority of stocks. 
This implies that machine learning and hybrid methods produce more accurate 
predictions on Vietnamese stock data compared to ARIMA - a traditional time 
series analysis method. 

Furthermore, below each MAPE is the range of absolute percentage errors of 
each model. This range includes minimum and maximum errors over the process 
of time series cross-validation. Ranges of the ANN and the hybrid models seem 
to be subsets of ranges of ARIMA models. This result implies that ANN and 
hybrid models give more stable performance than ARIMA models. 

The most clearly related works to our study are the ones by Atilla et al. [2], 
Huynh Quyet Thang [5] and Nguyen Nhu Hien et al. [11]. Atilla et al. used the 
same models to our study. However, these models were applied on a different 
dataset. In [5,11], the authors studied on Vietnamese stock market, but they 
examined their models on VN-Index and some stocks of HNX. Moreover, the 
authors also used Root Mean Squared Error (RMSE) while we used MAPE as 
an evaluation measure. Therefore, we made no comparison as well as conclusion 
between our study and the above-mentioned papers. However, since MAPE is a 
relative value (percentage), we can compare results of different stocks without 
noticing what the unit is. 


Table 3. MAPE of 20 listed stocks of VN-Index (%). 


Ticket | ARIMA ANN Hybrid Ticket | ARIMA ANN Hybrid 
CTD |0.5857 0.1604 0.1198 MWG 0.6174 0.1425 0.0629 

0.0, 3.6516] | [0.0016, 2.0693] | [0.0031, 0.8525 0.0, 2.3114] | [0.0016, 0.9554] | [(0.0008, 0.3441] 
DHG |0.7067 0.0957 0.0852 NVL_ /0.452 0.1137 0.1379 

0.0, 2.2226] | (0.0034, 0.4585] | [0.0004, 0.7587 0.0, 1.9389] | [0.0017, 1.0121] | [0.0015, 1.1799] 
ROS /|1.1189 0.2524 0.1841 DPM |0.8825 0.1359 0.1433 

0.0, 5.8533] | (0.0008, 1.9342] | [0.0034, 1.5147 0.0, 5.3179] | [(0.0006, 0.9245] | [(0.0034, 0.5912] 
FPT /0.7411 0.0961 0.0731 PNJ_ |0.8613 0.1917 0.1578 

0.0, 2.2657] | (0.0013, 0.6104] | [0.0015, 0.4405 0.0, 4.9514] | [0.0074, 1.423] |[0.0017, 1.5642] 
GMD |0.4653 0.087 0.1001 REE /|0.5697 0.0972 0.0921 

0.0, 3.252] |[0.0011, 0.689] | [0.0028, 0.5355 0.0, 3.0789] | [0.0003, 0.9905] | [0.0022,0.4002] 
CII 0.479 0.0666 0.0841 STB 0.5098 0.0835 0.0785 

0.0, 2.2832] | (0.0022, 0.2758] | [0.0013, 1.0391 0.0, 3.2084] | (0.0002, 0.5221] | (0.0016, 0.6935 
HDB /0.2801 0.0559 0.0649 SAB 0.8686 0.1832 0.1246 

0.0, 1.2336] | [0.0, 0.2153] 0.0003, 0.3026 0.0, 3.6254] | (0.0041, 1.3722] | (0.0015, 0.7896 
HPG |0.6342 0.1022 0.1228 SSI 0.5672 0.1304 0.0903 

0.0, 4.4476] | [0.0006, 0.6058] | [0.0019, 0.4752 0.0, 2.591] | [0.004, 0.8443] |[0.0015, 0.6173 
MSN |0.6406 0.0831 0.1126 TCB /0.6748 0.0644 0.2471 

0.0, 3.4088] | (0.0016, 0.6518] | [0.0003, 0.6386 0.0, 4.7885] | [0.0005, 0.2711] | [0.0024, 0.8618 
MBB /0.5 0.1313 0.107 SBT (0.5261 0.0863 0.105 

0.0, 2.4699] | [0.0048, 0.7402] | [0.0, 1.0252] 0.0, 6.1373] | [0.0002, 0.6595] | [0.0016, 0.8092 
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4 Conclusion 


Stock price prediction is an active research area over the last few decades. The 
accuracy of forecasting performance plays an important role for in any investor’s 
decision. In Vietnamese stock market, where most investors still use traditional 
methods, even their intuition, in stock trading, our study has proved the effective- 
ness of machine learning techniques in stock price prediction. By experiments, we 
have shown that machine learning and hybrid methods such as ARIMA+ANN 
outperformed traditional methods such as ARIMA on Vietnamese stock market. 
This means that there are many Vietnamese stocks having both linear and non- 
linear underlying structures in their historical data. Moreover, machine learning 
also gives more stable forecasting results than traditional time series analysis 
methods. 

Nonetheless, the dynamic of stock price in market is not only influenced by 
its own past, but also by a variety of factors such as news, financial indicators, 
political conditions, human psychology, etc. Machine learning is an approach 
which can combine all possible factors to improve the performance of stock price 
prediction. Therefore, using market factors as input features for machine learning 
models might be a promising research direction. 


Acknowledgements. The authors would like to thank Nguyen Van Thanh and 
Nguyen Huynh Huy for their comments helping to improve the manuscript of this 
work significantly. 


A Appendix 


A.1 List of Top 20 Stocks of VN-Index 


Table 4. List of top 20 stocks of VN-Index. 


Symbol | Name of company 

CTD Cotec Construction JSC 

DHG | DHG Pharmaceutical JSC 

ROS Faros Construction Corp 

FPT FPT Corp 

GMD _ | Gemadept Corp 

CII Ho Chi Minh City Infrastructure Investment JSC 
HDB Ho Chi Minh City Development Joint Stock Commercial Bank 
HPG Hoa Phat Group JSC 

MSN Masan Group Corp 

MBB Military Commercial Joint Stock Bank 


(continued) 


302 T.-P. Nguyen et al. 


Table 4. (continued) 


Symbol | Name of company 

MWG | Mobile World Investment Corp 

NVL No Va Land Investment Group Corp 

DPM _ | Petro Vietnam Fertilizer and Chemicals Corp 

PNJ Phu Nhuan Jewelry JSC 

REE Refrigeration Electrical Engineering Corp 

STB Sai Gon Thuong Tin Commercial Joint Stock Bank 
SAB Saigon Beer Alcohol Beverage Corp 


SSI Saigon Securities Incorporation 


TCB Vietnam Technological and Commercial Joint Stock Bank 
SBT Thanh Thanh Cong Tay Ninh JSC 


A.2 Application Interface 


Figure 4 shows the interface of the tool mentioned in Footnote 5. 


STOCK PRICE 
PREDICTION 


Fig. 4. Graphical user interface of the application. 
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Abstract. The furniture industry in Indonesia is multiplying, with average report 
rate was reached US$ 1,627 billion in 2017. According to the massive potential 
of the furniture industry in Indonesia, a method to increase its productivity and 
efficiency is indispensable. Group technology in cell manufacturing as an approach 
to achieve its previous objective is used to improve the production layout. Rank 
Order Clustering (ROC) and Hollier method are used to classifying machines 
and process in the production layouts. The Discrete Event Simulation (DES) as 
a validation tool proven that the layout improvement is successfully reducing the 
total travel time 24.6% and total travel distance up to 16.1%. 


Keywords: Group Technology - Rank Order Clustering - Furniture industry 


1 Introduction 


The furniture industry in Indonesia plays an important role and become one of the leading 
export contributors. In 2016, the furniture industry in Indonesia reached up to US$ 1.6 
billion in export. The number then increased in 2017, which it reaches up to US$ 1.627 
billion. Based on its sector clustering, the furniture industry could be classified as a craft 
sector. The craft sector ranked as the 2nd of the highest export contributor in 2015 and 
2016 up to 39.01% from total. However, the problems found in the furniture production 
floor become an obstacle that should not be neglected. 

Some of the problems found in the furniture production floor such as low coor- 
dination between the workstations, the non-updated information of finished parts and 
work in progress part current status, duplication and unnecessary amount of some part 
in the production line, irregular and inefficient flow and transport part. The problems 
in the production floor could be simplified by minimizing the flow of part transport 
time, grouping part into family based on the process similarity, and forming the Cell 
Manufacturing (CM) [1]. CM is denoted as part of Group Technology (GT), which aims 
to increase the productivity and efficiency in regards to flow time and production cost 
[2]. GT is suitable for job shop production system, namely high variation products to 
produced in relatively small number [3]. 

A grouping technique to build the CM layout can be done with several methods such 
as heuristics and clustering approach [4-6]. CM implementations carry some benefits, 
such as reducing the non-value added time and increase group efficacy [7, 8]. 


© Springer Nature Switzerland AG 2020 
H. A. Le Thi et al. (Eds.): ICCSAMA 2019, AISC 1121, pp. 304-310, 2020. 
https://doi.org/10.1007/978-3-030-38364-0_27 
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2 Methods 


Clustering Method. To form the CM, a Rank Order Clustering (ROC) methods applied. 
Firstly, four selected furniture products are taken as a sample model named SN-5, SN-6, 
SN-7, and SN-8. In general, the Operation Process Chart (OPC) to make the furniture 
products consist of six stages, including crosscutting, spindling, mortis, pinning, drilling, 
and profiling. Forty-three different parts are used to form the furniture products, clustered 
by considering the Production Flow Analysis (PFA) which hence, forms 14 part groups. 
The details of 14 grouping parts are shown in Table 1. 


Table 1. Part clustering. 


Cluster | Parts name 

(GN, GDB/SB, SD, SDB, SRP) SN-6; (SLTB, GDB) SN-7; (SS) SN-5 
(TGN) SN-6; (TGN) SN-7; (TGN) SN-8 

(SDB, SDD) SN-8 

(GN/SB) SN-5 

(PLD) SN-5 

(GN) SN-7 

(FSDR, FDS) SN-6; (SS) SN-7; (FSDR, SS) SN-8 

(SLTB, SRP, SD) SN-5; (SRP) SN-7; (SLTB, GBD, SRP) SN-8 
(KB) SN-5; (KB, KD) SN-6; (KB, KD) SN-7; (KB, KD) SN-8 
(GN, SB) SN-8; (TGN) SN-5 

(KD) SN-5 

(SB,SDB/SDD) SN-7 

(FSDR) SN-7 

(GBD) SN-5 


Z/Z(/C ATF aa) AHO, a\;o\ Se 


Secondly, ROC used to form the CM group by using the previous 14 parts group. 
Three times iteration generates new parts/machines matrix shown in Table 2. The grey 
area indicated the CM clustering group. CM 1 consist of I, A, K, F, and D. Subsequently, 
M, E, C, G,N, L, B, J, and H included in CM 2. 

From Table 2, two CM groups formed. However, fifteen exceptional elements found 
which is not included in CM clustering groups. Hence, the production process of 15 
exceptional elements conducted outside the CM clustering group. 


The Machine Sequencing. As the CM groups identified, the sequence of the machine 
is determined to minimize the Backtracking Move (BTM). Hollier methods are used to 
specify the flow parts efficiency in both CM 1 and CM 2. The flow part within machines 
in CM | is shown in Table 3 followed with flow diagrams in Fig. 1. In addition, Table 4 
presented the CM 2 within machines flow parts, and Fig. 2 illustrated its diagram flows. 
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Table 2. Iteration matrix of parts/machines. 
Parts/ Cross Pro Spin- Mor- Bor Ten- Ten- Pro- Bor Spin Cross 
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Table 3. Cell manufacturing | within machines flow part. 


From/To | Crosscut 1 | Profil 1 | Spindle 1 | Mortis | Bor 2 | Tenon 1 | From 

Sums 
Crosscut 1 25800 300 5200 | 10000 | 41300 
Profil 1 0 
Spindle 1 | 13200 1000 2800 | 22000 | 39000 
Mortis 9000 2300 11300 
Bor 2 2300 8000 10300 
Tenon | 30000 2000 32000 
Tosums_ | 13200 41300 | 25800 11300 | 10300 | 32000 


Layout Improvement. Following the CM grouping, the improvement of production 
layout is made. The production layout improvement is built attributed to part/machine 
and flow diagrams. The illustration of previous production layout is shown in Fig. 3 
followed with the improved layout in Fig. 4. 
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Fig. 1. Cell manufacturing 1 flow diagrams. 
Table 4. Cell manufacturing 2 within machines flow part. 

From/To | Tenon 2 Profil 2 | Bor 1 | Spindle 2 | Crosscut 2 | From 

Sums 
Tenon 2 21000 | 3400 24400 
Profil 2 0 
Bor | 1000 10500 2700 14200 
Spindle 2 | 900 2100 13400 1300 17700 
Crosscut 2 | 23400 2900 26300 
Tosums 25300 | 33600 | 16800 5600 1300 


Simulation Model. 


gathered. 
21000 
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1300 2700 
34200 | jel vy 33600 
in} ™ 23400 _, / \ 13400 10500 “, \ out 
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Simulation model represent the real system which built to test- 
ing the problem solutions. In this study, simulation model is conducted to verify the 
efficiency results of the production layout improvement. Company policy, production 
layout, production resources, and duration in each production process are considered 
when simulate the model. Utilize the Flexsim software, the simulation model of total 
transport time and total travel distance between previous layout and improved layout is 


Fig. 2. Cell manufacturing 2 flow diagrams. 
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3 Results 


Fig. 4. Production layout after improvement. 


The results of simulation model between improved layout and non-improved layout 
shows significant contribution regarding the travel distance, time efficiency and cost 
savings. Table 5 shows the comparison of actual layout and improved layout in terms of 
total travel distance and total travel time. 
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Table 5. Comparison of actual layout and improvement layout. 


Model Total travel distance (km) | Total transport time (working days) 
Actual layout 4317,301 138,50 
Improvement layout | 3621,93 104,48 


Information from Table 5 indicated that the improved layouts could reduce the total 
travel distance 16.1% (695.37 km). Meanwhile, the total transport time is reduced up to 
24.6% (34.02 working days) after the improvement. 

Additionally, the furniture company set the minimum payment of transporter person 
is 27.500,00 (Indonesian Rupiah) per day. As the improvement layout could reducing 
up to 34.02 working days, thus the company management could save 7.649.600,00 
(Indonesian Rupiah). 


4 Discussions and Conclusions 


The implementation of GT in the furniture industry is proven could benefits the company, 
indicated in time efficiency and cost savings after the layout improvement. The clustering 
methods such as ROC helps the formation of CM and mapping the flow diagrams in 
each cell. Additionally, this study is adopting the similar furniture products. Further 
investigation in different parts, and product families or various industries to explore the 
GT implementation is needed. Furthermore, various grouping techniques and algorithm 
to build the CM groups is fascinating to explore. 
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Abstract. Using machine learning techniques in financial markets, par- 
ticularly in stock trading, attracts a lot of attention from both academia 
and practitioners in recent years. Researchers have studied different 
supervised and unsupervised learning techniques to either predict stock 
price movement or make decisions in the market. 

In this paper we study the usage of reinforcement learning techniques 
in stock trading. We evaluate the approach on real-world stock dataset. 
We compare the deep reinforcement learning approach with state-of 
the-art supervised deep learning prediction in real-world data. Given the 
nature of the market where the true parameters will never be revealed, we 
believe that the reinforcement learning has a lot of potential in decision- 
making for stock trading. 


Keywords: Reinforcement learning - Machine learning - Stock trading 


1 Introduction 


Searching for an effective model to predict the prices of the financial markets is 
an active research topic today [13] despite the fact that many research studies 
have been published for a long time [3,11]. In the midst of financial markets 
prediction, stock price prediction is considered as one of the most difficult tasks 
[44]. Among the state-of-the-art techniques, machine learning techniques are the 
most widely chosen techniques in recent years, given the rapid development of the 
machine learning community. The other reason is that the traditional statistical 
learning algorithms can not cope with the non-stationary and non-linearity of 
the stock markets [15]. 

In general, there exists two main approaches to analyze and predict stock 
price which are technical analysis [23] and fundamental analysis [39]. The tech- 
nical analysis looks into the past data of the market only to predict the future. 
On the other hand, the fundamental analysis takes into account other infor- 
mation such the economic status, news, financial reports, meeting notes of the 
discussion between CEOs, etc. 

The technical analysis relies on the efficient market hypothesis (EMH) [25]. 
The EMH states that all the fluctuation in the market will be reflected very 
quickly in the price of stocks. In practice, the price can be updated in the mag- 
nitude of milliseconds [8], leading to a very high volatility of the stocks. In recent 
© Springer Nature Switzerland AG 2020 
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years the technical analysis attracts a lot of attention due to a simple fact that we 
have enough information just by looking to the historical stock market, which is 
public and well-organized, compared to the fundamental analysis where we need 
to analyze unstructured dataset. 

Compared to the supervised learning techniques and at a certain level, unsu- 
pervised learning algorithms, are widely used in stock price prediction, to the 
best of our knowledge the reinforcement learning for stock price prediction has 
not yet received enough support as it should be. The main issue of supervised 
learning algorithms is that they are not adequate to deal with time-delayed 
reward [18,22]. In other words, supervised learning algorithms focus only on 
the accuracy of the prediction at the moment without considering the delayed 
penalty or reward. Furthermore, most supervised machine learning algorithms 
can only provide action recommendation on particular stocks!, using reinforce- 
ment learning can lead us directly to the decision making step, i.e. to decide how 
to buy, hold or sell any stock. 

In the present paper we study the usage of reinforcement learning in stock 
trading. We review some related works in Sect. 2. We present our approach in 
Sect. 4. We describe and discuss the experimental results in Sect. 5. We conclude 
the paper and draw some future research directions in Sect. 6. 


2 Literature Review 


There are two main applications of using machine learning in the stock markets: 
stock price prediction and stock trading. 

Stock price prediction can be divided into two applications: price regression 
or stock trend prediction. In the first application, the researchers aim to pre- 
dict exactly the numerical price, usually based on day-wise price [15] or closed 
price of a stock. In the second approach, the researchers usually aim to predict 
the turning point of a stock price, i.e. when the stock price change the moving 
direction from up to down or vice versa [44]. Traditionally time-series forecasting 
techniques such as ARIMA and its variant [26,43] are adapted from the econo- 
metric literature. However, these methods cannot cope with non-stationary and 
non-linearity nature of the stock market [2]. 

It is claimed that the stock price reflects the belief or opinions of the market 
on the stock rather than the value of the stock itself [7]. Several research studies 
propose to analyzing the social opinions to predict the stock price. In the research 
study of [33], the authors used Google Trends, i.e. to analyze the Google query 
volumes for a particular keyword or search term, they can measure the attention 
level of the public to a stock. The research is based on one idea that a decision 
making process will start by information collection [37]. 

Over centuries, the researchers and practitioners have developed many tech- 
nical indicators to predict the stock price [29]. Technical indicators are defined 


* According to NASDAQ standard, recommendation from analysts can be Strong 
Buy, Buy, Hold, Underperform or Sell. Reference: https: //www-.nasdaq.com/quotes/ 
analyst-recommendations.aspx. Accessed on 07-September-2019. 
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as a set of tool that allow us to predict the future stock market by solely look- 
ing to the historical market data [31]. Originally the technical analysis are not 
highly supported in academia [27] even though it is very common in practice [35]. 
Nevertheless, with the development of the machine learning community, tech- 
nical analysis gains attention of researchers in recent years. [32] derived what 
the authors called “Trend Deterministic Data Preparation” from ten technical 
indicators then combined them with several machine learning techniques such 
as Support Vector Machine (SVM) or Random-Forest for the stock price move- 
ment prediction. The “Trend Deterministic Data Preparation” are simply the 
indication from the technical indicators that the price will go up or down, so the 
approach of [32] can be considered as ensemble learning from local experts [17]. 

The authors of [44] inherited the idea of Japanese candle stick in stock anal- 
ysis [30] to develop a status box method combined with probabilistic SVM to 
predict the stock movement. We visualized a Japanese candlestick in Fig. 1. The 
status box developed by [44] is presented in Fig. 2. The main idea is instead of 
focusing on only one time period as the traditional Japanese candlestick method, 
the status box focus on a wider range of time, that allow us to overcome the 
small fluctuation in the price. 


HIGH HIGH 
j «UPPER. 
CLOSE=**+ SHADOW so ee OPEN 
(WICK) 
- BODY-------- > 
OPEN pangs eee 2 2 LOWER ee ee eee CLOSE 
SHADOW 
: (WICK) : 
LOW LOW 


Fig. 1. Japanese candle stick using in stock analysis. A candlestick shows us the highest 
and lowest price of a stock in a period of time, as well as opening and closing price of 
this stock. 


It is quite straightforward to combine technical analysis and fundamental 
analysis into a single predictive model. The authors of [42] combined news anal- 
ysis with technical analysis for stock movement prediction and claimed that the 
combination yields a better predictive performance than any single source. 

Deep learning, both supervised and unsupervised techniques, have been uti- 
lized for stock market prediction. One of very first research work in this segment 
belongs to the work of [40] published in 1996 to use recurrent neural networks 
(RNN) in ARIMA-based features. Many other feature extraction methods based 
on supervised or unsupervised learning have been developed since then [21]. 
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Fig. 2. A status box based on [44]. 


The authors of [4] used three different unsupervised feature extraction meth- 
ods: principal component analysis (PCA), auto-encoder and restricted Boltz- 
mann machine (RBM) for the auto-regressive (AR) model. In the same direc- 
tion, the authors of [24] designed a multi-filters neural network to automatically 
extract features from stock price time-series. The authors combine both convo- 
lutional and recurrent filters to one network for the feature extraction task. In 
[45] the authors used Empirical Mode Decomposition [34] with neural networks 
for the feature extraction. 


3 Our Contribution 


Compared to supervised/unsupervised learning approaches, the difference of our 
approach is that we generate a trading strategy rather than only stock price 
prediction as in existing research studies [10,19]. Stock price prediction definitely 
is a very important task, but eventually we need to build a strategy to decide 
what to buy and what to sell in the market that requires a further research step. 
In this study, we employ a simple baseline greedy strategy given the prediction of 
a supervised learning algorithm (RNN-LSTM) but definitely studying a strategy 
based on the prediction of another algorithm is not a trivial task. 

Several research works in using reinforcement learning have been presented 
[1,6,22] in literature. However, the work of [22] is presented in 2001 using only 
TD(0) algorithm, while the works of [6] or [1] used external information such as 
news for the trading task. In our approach we do not use any external information 
but only the historical data of stock prices for the trading. 
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action 
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Fig. 3. The interaction between agent and environment in reinforcement learning. 


4 Reinforcement Learning 


Reinforcement learning [88] is visualized in Fig.3. Different from supervised 
learning techniques that can learn the entire dataset in one scan, the reinforce- 
ment learning agent learns by interacting repeatedly with the environment. We 
can imagine the agent as a stock trader and the environment as the stock market 
[22]. At a time step ¢, the agent performs an action A; and receives a reward 
Rizi = R(S;, Az). The environment then move to the new state S:41 = 6(S;, Az). 
The agent needs to learn a policy 7: S — A, i.e. learn to react to the environ- 
ment in which it can maximize the total reward as: 


V"(S;) = » Sue eae (1) 
k=0 


Here, the coefficient y represents the decay factor, usually consider as interest 
rate in finance, reflects the “one dollar today is better than one dollar tomorrow” 
statement. It means any trading strategy should beat the risk-free interest rate 
because otherwise a reasonable investor should not invest to this strategy at all - 
she should invest money to the risk-free rate such as buying T-bonds or opening 
a saving account. The latter option will give her a better profit and lower risk. 
However in high-frequency trading and short period of time we can set y close 
to 1. The optimal policy is notated as 1*. 

In this paper we employ Deep Q-learning [28] by approximate the opti- 
mal policy function by a deep neural network. The term “Deep” here refers 
to Deep Convolutional Neural Networks (CNNs) [36]. Here we parameterize the 
Q-function by a parameter set 6. 

In our settings, the actions are similar to other stock trading studies. The 
possible actions include buy, hold or sell. We defined the rewards as the profit 
(positive, neutral or negative) after each action. 

The loss function is: 


1 
L(8) = 57 >| (Qo(Sis Ai) — Qo(Si, Ai))? (2) 
i€N 
with 
Qo = R(St, At) + ymaz 4; (Sj, Aj) (3) 
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The network is updated by a normal gradient descent: 


9-9-a0% (4) 


In deep Q-network, the gradient of the loss function is calculated as: 


Vo,L(6i) = Es,a~P(.),8'~€[( Regi tymax 4: Q($141,4)-Q(S,A,0:))Vo,Q(S,4,0:)] (5) 


5 Experiments 


5.1 Datasets 


We use the daily stock price of more than 7,000 US-based stocks collected up to 
10-November-20177. For each stock, we always use the period of time from 01- 
January-2017 until 10-November-2017 for testing, and the data from 01-January- 
2015 until 31-December-2016 as the training set. Hence, there are 504 samples 
for training and 218 samples for testing. The sample size is so small compared to 
well-known supervised learning problem such as ImageNet [20] that contains one 
million labelled images, but as we will present in Sect. 5.2, we still can generate 
positive profit strategies. 
The stock price of Google is displayed in Fig. 4. 
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Fig. 4. The stock price of Google from 01-Jan-2015 to 10-November-2017 


5.2 Experimental Results 


We evaluate three variants of Deep Q-learning which is vanilla Deep Q-learning 
[28], double DQN [12] and dueling double DQN [41]. 

We used RNN-LSTM model with a greedy strategy approach as the baseline 
model. According to a recent time-series prediction? organized in July 2019, the 
RNN-LSTM model achieved the highest performance in forecasting a financial 
time-series. The greedy strategy means that we buy every-time we predict the 
stock will go up and sell if we predict the stock will go down. 

We visualize the performance of four models vanilla DQN, double DQN, 
dueling DQN and the baseline LSTM model on the Google stock (code: GOOGL) 
in Figs.5, 6 and 7 respectively. The profit of each model are described in the 


? https: //www.kaggle.com /borismarjanovic/price-volume-data-for-all-us-stocks-etfs. 
3 https://www.isods.org/. 
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Fig. 5. Performance of vanilla DQN on Google stock. The profit on the test period is 
—838. 
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Fig. 6. Performance of Double DQN on Google stock. The profit on the test period is 
1430. 
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Fig. 7. Performance of Dueling Double DQN on Google stock. The profit on the test 
period is 141. 


figure. In each figure, we visualize the profit of each model in general, the profit 
of each model against the stock price and the profit of each model against the 
standard deviation of the stock price. 

Generally speaking, the Deep Q-network yields the highest average profit 
compared to the Double Deep Q-Network and the Dueling Deep Q-Network. 
The result is consistent to other studies [14]. However, as expected, Deep Q- 
Network yields a higher volatility compared to two other methods. We display 
the distribution of profit generated by each model in Fig. 8. It is clear that Double 
Q-Network seeks for profit, hence sometimes it generates the negative profit. We 
note that the LSTM model combined with the greedy algorithm is much more 
stable than other models because we buy and sell immediately when we detect 
any signal of changing the price. 

As described above, we visualize the profit against mean of stock price and 
standard deviation of stock price in the testing period in Figs. 9 and 10. The main 
idea is to see if a model behave differently or not in different segments of stock 
price. The general trends for all models are the expected profit range are higher 
given the higher stock price and stock volatility. The results are consistency with 
the principle of no arbitrage in finance [9] which basically stated that we cannot 
expect a higher profit without facing a higher risk. 
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Fig. 8. Distribution of profit of three models 
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Fig. 9. The profit against mean of stock price in the testing period 
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Fig. 10. The profit against standard deviation of stock price in the testing period 


6 Conclusions 


In this paper we study the usage of Deep Q-Network for stock trading. We eval- 
uated the performance of Deep Q-Network in large-scale real-world datasets. 
Deep Q-Network allow us to trade the stock directly without taking further 
optimization step like other supervised learning methods. Using only few hun- 
dreds samples, reinforcement learning algorithms variants based on Q-learning 
can generate the strategies that on average earning a positive profit. 

In the future, we plan to incorporate multiple stock trading, i.e. portfolio 
management strategies, into the study. Furthermore, we will introduce different 
constraints into the model, for instance the maximum loss one can resist while 
using a model. Another approach is to integrate simulated behavior of users in 
non-cooperative or cooperative markets [5,16]. 


Acknowledgment. We would like to thank the anonymous reviewer for valuable 
comments. 


320 


Q.-V. Dang 


References 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


. Azhikodan, A.R., Bhat, A.G., Jadhav, M.V.: Stock trading bot using deep rein- 


forcement learning. In: Innovations in Computer Science and Engineering, pp. 41— 
49. Springer, Singapore (2019) 


. Bisoi, R., Dash, P.K.: A hybrid evolutionary dynamic neural network for stock 


market trend analysis and prediction using unscented Kalman filter. Appl. Soft 
Comput. 19, 41-56 (2014) 

Bradley, D.A.: Stock Market Prediction: The Planetary Barometer and how to Use 
it. Llewellyn Publications, Woodbury (1948) 


. Chong, E., Han, C., Park, F.C.: Deep learning networks for stock market analysis 


and prediction: methodology, data representations, and case studies. Expert Syst. 
Appl. 83, 187-205 (2017) 

Dang, Q., Ignat, C.: Computational trust model for repeated trust games. In: 
Trustcom/BigDataSE/ISPA, pp. 34-41. IEEE (2016) 

Deng, Y., Bao, F., Kong, Y., Ren, Z., Dai, Q.: Deep direct reinforcement learning 
for financial signal representation and trading. IEEE Trans. Neural Netw. Learn. 
Syst. 28(3), 653-664 (2017) 

Elton, E.J., Gruber, M.J., Brown, S.J., Goetzmann, W.N.: Modern Portfolio The- 
ory and Investment Analysis, 9th edn. Wiley, Hoboken (2014) 

Florescu, I., Mariani, M.C., Stanley, H.E., Viens, F.G.: Handbook of High- 
frequency Trading and Modeling in Finance, vol. 9. Wiley, Hoboken (2016) 
Follmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time, 4th 
edn. Walter de Gruyter, Berlin (2016) 

Gocken, M., Ozcalici, M., Boru, A., Dosdogru, A.T.: Stock price prediction using 
hybrid soft computing models incorporating parameter tuning and input variable 
selection. Neural Comput. Appl. 31(2), 577-592 (2019) 

Granger, C.W.J., Morgenstern, O.: Predictability of Stock Market Prices. Heath 
Lexington Books, Lexington (1970) 

van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double 
q-learning. In: AAAI, pp. 2094-2100. AAAT Press (2016) 

Henrique, B.M., Sobreiro, V.A., Kimura, H.: Literature review: machine learning 
techniques applied to financial market prediction. Expert Syst. Appl. 124, 226-251 
(2019) 

Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, 
D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J., Leibo, 
J.Z., Gruslys, A.: Deep q-learning from demonstrations. In: AAAI, pp. 3223-3230. 
AAAT Press (2018) 

Hiransha, M., Gopalakrishnan, E.A., Menon, V.K., Soman, K.: NSE stock mar- 
ket prediction using deep-learning models. Procedia Comput. Sci. 132, 1351-1362 
(2018) 

Ignat, C.L., Dang, Q.V., Shalin, V.L.: The influence of trust score on cooperative 
behavior. ACM Trans. Internet Technol. (TOIT) 19(4), 46 (2019) 

Jacobs, R.A., Jordan, M.I., Nowlan, $.J., Hinton, G.E., et al.: Adaptive mixtures 
of local experts. Neural Comput. 3(1), 79-87 (1991) 

Jangmin, O., Lee, J., Lee, J.W., Zhang, B.T.: Adaptive stock trading with dynamic 
asset allocation using reinforcement learning. Inf. Sci. 176(15), 2121-2147 (2006) 
Jiang, X., Pan, S., Jiang, J., Long, G.: Cross-domain deep learning approach for 
multiple financial market prediction. In: IJCNN, pp. 1-8. IEEE (2018) 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 
40. 


41. 


Reinforcement Learning in Stock Trading 321 


Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- 
volutional neural networks. In: NIPS, pp. 1106-1114 (2012) 

Langkvist, M., Karlsson, L., Loutfi, A.: A review of unsupervised feature learning 
and deep learning for time-series modeling. Pattern Recogn. Lett. 42, 11-24 (2014) 
Lee, J.W.: Stock price prediction using reinforcement learning. In: ISIE 2001. 2001 
IEEE International Symposium on Industrial Electronics Proceedings (Cat. No. 
01TH8570), vol. 1, pp. 690-695. IEEE (2001) 

Lo, A.W., Mamaysky, H., Wang, J.: Foundations of technical analysis: computa- 
tional algorithms, statistical inference, and empirical implementation. J. Financ. 
55(4), 1705-1765 (2000) 

Long, W., Lu, Z., Cui, L.: Deep learning-based feature engineering for stock price 
movement prediction. Knowl. Based Syst. 164, 163-173 (2019) 

Malkiel, B.G., Fama, E.F.: Efficient capital markets: a review of theory and empir- 
ical work. J. Financ. 25(2), 383-417 (1970) 

Menon, V.K., Vasireddy, N.C., Jami, S.A., Pedamallu, V.T.N., Sureshkumar, V., 
Soman, K.: Bulk price forecasting using spark over NSE data set. In: International 
Conference on Data Mining and Big Data, pp. 137-146. Springer, Cham (2016) 
Mitra, S.K.: How rewarding is technical analysis in the indian stock market? Quant. 
Financ. 11(2), 287-297 (2011) 

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, 
D., Riedmiller, M.A.: Playing Atari with deep reinforcement learning. CoRR 
abs/1312.5602 (2013) 

Nazario, R.T.F., e Silva, J.L., Sobreiro, V.A., Kimura, H.: A literature review of 
technical analysis on stock markets. Q. Rev. Econ. Financ. 66, 115-126 (2017) 
Nison, S.: Japanese Candlestick Charting Techniques: A Contemporary Guide to 
the Ancient Investment Techniques of the Far East. Penguin, New York (2001) 
Park, C.H., Irwin, $.H.: What do we know about the profitability of technical 
analysis? J. Econ. Surv. 21(4), 786-826 (2007) 

Patel, J., Shah, S., Thakkar, P., Kotecha, K.: Predicting stock and stock price 
index movement using trend deterministic data preparation and machine learning 
techniques. Expert Syst. Appl. 42(1), 259-268 (2015) 

Preis, T., Moat, H.S., Stanley, H.E.: Quantifying trading behavior in financial 
markets using Google trends. Sci. Rep. 3, 1684 (2013) 

Rilling, G., Flandrin, P., Goncalves, P., et al.: On empirical mode decomposition 
and its algorithms. In: IEEE-EURASIP Workshop on Nonlinear Signal and Image 
Processing, vol. 3, pp. 8-11. NSIP 2003, Grado (1) (2003) 

Schulmeister, S.: Profitability of technical stock trading: has it moved from daily 
to intraday data? Rev. Financ. Econ. 18(4), 190-201 (2009) 

Sewak, M.: Deep Reinforcement Learning - Frontiers of Artificial Intelligence. 
Springer, Singapore (2019) 

Simon, H.A.: A behavioral model of rational choice. Q. J. Econ. 69(1), 99-118 
(1955) 

Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, 
Cambridge (2018) 

Thomsett, M.C.: Getting Started in Fundamental Analysis. Wiley, Hoboken (2006) 
Wang, J., Leu, J.: Stock market trend prediction using ARIMA-based neural net- 
works. In: ICNN, pp. 2160-2165. IEEE (1996) 

Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., de Freitas, N.: 
Dueling network architectures for deep reinforcement learning. In: ICML. JMLR 
Workshop and Conference Proceedings, vol. 48, pp. 1995-2003. JMLR.org (2016) 


322 Q.-V. Dang 


42. Zhai, Y.Z., Hsu, A.L., Halgamuge, $.K.: Combining news and technical indicators 
in daily stock price trends prediction. In: ISNN (3). Lecture Notes in Computer 
Science, vol. 4493, pp. 1087-1096. Springer, Heidelberg (2007) 

43. Zhang, G.P.: Time series forecasting using a hybrid ARIMA and neural network 
model. Neurocomputing 50, 159-175 (2003) 

44. Zhang, X., Li, A., Pan, R.: Stock trend prediction based on a new status box 
method and adaboost probabilistic support vector machine. Appl. Soft Comput. 
A9, 385-398 (2016) 

45. Zhou, F., Zhou, H., Yang, Z., Yang, L.: EMD2FNN: a strategy combining empirical 
mode decomposition and factorization machine based neural network for stock 
market trend prediction. Expert Syst. Appl. 115, 136-151 (2019) 


® 


Check for 
updates 


A Survey on Forecasting Models for Preventing 
Terrorism 


Botambu Collins!, Dinh Tuyen Hoang!*, HyoJeon Yoon!, Ngoc Thanh Nguyen’, 
and Dosam Hwang!) 
| Department of Computer Engineering, Yeungnam University, Gyeongsan, South Korea 
botambucollins@gmail.com, hoangdinhtuyen@gmail.com, 
hjyoon314@ynu.ac.kr, dosamhwang@gmail.com 
ze Faculty of Computer Science and Management, 
Wroclaw University of Science and Technology, Wroclaw, Poland 
Ngoc-Thanh.Nguyen@pwr.edu.pl 
3 Faculty of Engineering and Information Technology, Quang Binh University, Dong Hoi, 
Vietnam 


Abstract. The security and welfare of any country are very crucial with states 
investing heavily to protect their territorial integrity from external aggression. 
However, the increase in the act of terrorism has given birth to a new form of 
security challenges. Terrorism has caused untold suffering and damages to civil- 
ian lives and properties and hence, finding a lasting solution to terrorism becomes 
inevitable. Until recently, the discourse on the nature and means of combating ter- 
rorism have largely been debated by politicians or statesmen. This study attempts 
an appraisal of machine learning survey on models in preventing terrorism via fore- 
casting. Since the act of terrorism is dynamic in nature, i.e. their strategies change 
as counterterrorism methods are improved thereby requiring a more sophisticated 
way of predicting their moves. Some models discussed in this study include the 
Hawkes process, STONE, SNA, TGPM, and Dynamic Bayesian Network (DBN) 
which are all geared towards predicting the likelihood of a terrorist attack. 


Keywords: Terrorism - Counterterrorism - Forecasting models - 
Counterterrorism models 


1 Introduction and Background 


Nowadays, Terrorism is a very crucial issue affecting our contemporary world. Terrorism 
derived its meaning from the word “terror” dating back to the French revolution around 
the 1790s, pertaining to the French government use of force to intimidate and coerce the 
populace to submission [4, 10]. Until recently, the debate about terrorism and means of 
combating it has always been viewed as a political science discourse. Thus, the effort 
towards combating terrorism has been in the hands of politicians. Contrary to this, is the 
fact that the weapons used to carry out an act of terrorism are manufacture by engineers 
and hence, there is a need for an engineering model in handling terrorism. Before we 
proceed, it is crucial for us to understand the meaning of terrorism. 
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The discourse on the nature and meaning of terrorism has been a conflicting one 
since there has not been an agreement on the definition of terrorism, [18]. The FBI 
defined terrorism as “the unlawful use of force and violence against persons or property 
to intimidate or coerce a government, the civilian population, or any segment thereof, 
in furtherance of political or social objectives”. UN Security Council Resolution [18] 
surmises a widely definition of terrorism which states that any “criminal acts, including 
against civilians, committed with the intent to cause death or serious bodily injury, or 
taking of hostages, with the purpose to provoke a state of terror in the general public 
or in a group of persons or particular persons, intimidate a population or compel a 
government or an international organization to do or to abstain from doing any act” 
Tavares [16] surmises short term goals of terrorism which include; 

(1) Gaining publicity and media attention [4, 16] (2) destabilizing political commu- 
nity and (3) waging havoc to the economies. The definition of terrorism holds divergence 
views, and this has been the major problem in countering terrorism since one state sees 
a group and designate it as a terrorist while another state simply has a contrary view. 
“one-man terrorist is another man freedom fighter” [4, 10, 18]. Although the British and 
her allied designated the Irish Republican Army (IRA) as a terrorist group, they saw 
themselves as freedom fighters fighting for a just cause and they had sympathizers and 
support, the problem of who a terrorist is, rely on the state’s interest [10]. 


2 Rationale for Terrorism 


Many arguments have been advanced by scholars of peace and conflict studies, political 
science and sociology as to why people engaged in act of terrorism [1, 4, 16]. Under- 
standing why people join terrorism or become a terrorist is a necessary step toward 
combating terrorism, the following are some of the causes of terrorism. 


2.1 Economic and Social Deprivation 


Some scholars have argued that poverty and economic deprivation is one of the funda- 
mental causes of terrorism and its related activities [12]. When people are poor, they may 
engage themselves in an unscrupulous act just to make end means. ISIS which is now 
believed to be one of the largest terrorist organization on Earth recruited Europeans and 
other members online by promising huge wages as well as other benefits such as offering 
fighters free women. As a result, their numbers quickly multiplied within a short-term 
period. [12] contends that persistent poverty and oppression can lead to hopelessness 
and despair and thus giving birth to terror as it becomes easy for a terrorist organization 
to recruit such poverty enshrined individuals. Arguing on the fact of poverty, it is argued 
that not all terrorist group emerges as a result of poverty and not every poverty enshrines 
individual to join or is motivated to become a terrorist. Poverty cannot be the bases for 
someone to commit an act of terror, for instance, Osama Bin Laden, one of the leaders 
of the Al-Qaeda terrorist network was from a wealthy family yet he became a leader of 
a terrorist group. 
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2.2 State Weakness (Failed State) (Syria, Iraq, Libya, CAR, Somali) 


Failed states or state failure are those states that are unable to deliver political, social, 
economic goods to its citizens as a result of political upheavals [13, 19]. Such a state is 
characterized by insecurity and chaos. When a state fails or collapses, there is high crime 
wave, and the proliferation of weapons becomes imminent. The Fund for Peace alludes 
that “state is failing when its government is losing physical control of its territory or lacks 
a monopoly on the legitimate use of force’. Failed state or state weakness has often been 
the cause of terrorism, most failed states become safe havens for terrorist organization 
[17]. For instance, ISIS took advantage of Libya as a failed State and expanded its 
network in North Africa, same is true in Syria and Iraq where ISIS or “Daesh” gained 
enough land and resource control. It is therefore important to take into account state 
weakness or failed state when finding a suitable means of combating terrorism. 


2.3 External Caused Based on Invasion (Iraq, Afghanistan, Libya) 


Although the USA and her allies had often played a pivotal role in fighting terrorism, 
they also hold a greater share of causing terrorism through promoting radicalization as 
well as causing state failure. State failure has been used as a means for imperialism 
and invasion by stronger states [17]. Most of the countries in which the US and her 
allied invaded in the name of “getting rid of dictator” has often cause more harm than 
good, the US and British invasion of Iraq and Afghanistan in the name of getting rid 
of Saddam Hussein and the so-called Weapon of Mass Destruction (WMD) of which 
no such weapons were found in Iraq. The main motive behind the Invasion of both Iraq 
and Afghanistan stemmed from the anger of the September 11 attacked and thus such 
invasion had massive support. The White House posits that “The events of 9/11, taught 
us that weak states, like Afghanistan, can pose a great danger to our national interests as 
a strong state” [1]. After the invasion of Iraq, a stable state, it later became a failed state 
with supporters of Hussein joining rebels faction, with abject poverty and misery, Iraq 
became a breeding ground for most terrorist groups, same as in Afghanistan, war has 
far-reaching consequences such as destruction, poverty which come after the war has 
ended and sometimes famine. As already discussed above, when people are poor they 
become a source of employment by terrorist organization and this was true in Iraq and 
Afghanistan, also the case of Libya, one of the wealthy state in the world at the time 
suffered invasion from NATO and became a failed state and today it is a safe haven for 
ISIS and other terrorist groups. 


2.4 Religious Fundamentalism and Extremism 


This is view as a type of terrorism in itself and also as cause of terrorism, most of the 
renowned terrorist organization today are link to religious movement especially Islam, 
not to say that Islam is a religion of terror but group such as Al Shabaab, Al Nusra Front, 
the AIQIM, Boko Haram, Al Qaeda usually glorify Allah (God) in their action. ISIS’ 
main goal is to install a global caliphate and make Islam the sole religion on Earth, while 
Boko Haram main motive is to eradicate Western values and makes Nigeria an Islamic 
republic and also seek to install a caliphate in the whole of Africa. Almost all Islamic link 
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terrorism group use the slogan “Allahu Akbar” meaning God is the greatest. Religious 
fanatics and extremist often believe that they are certain rules in which believers must 
obey, and they often see non-believers as a threat or as enemies, and hence they need to 
be eradicated. In most cases, the perpetrator of these acts is made to believe that such 
action is ordained by God and there is no other God except the God in which their belief 
system is inclined with. There is constant propaganda that one is guaranteed of a good 
life in heaven, he will be welcome by 72 virgins after a suicide act has been committed 
and this made suicide bombers even more happy to carry out such an act. 


3 Types of Terrorism 


They exist a plethora of terrorism types such as State terrorism, Nuclear terrorism, Bioter- 
rorism, Narcoterrorism, Ecoterrorism with each aim at achieving maximum damage to 
its victims. Some counterterrorism strategies are such that one must understand the type 
and nature of terrorism before deriving a model at combating it. We will discuss the 
relevant type of terrorism such as Nuclear terrorism, cyberterrorism. 


3.1 Nuclear Terrorism 


The aftermath of the Atomic bomb that meltdown the city of Hiroshima and Nagasaki 
are still fresh in our mind, with the devastating effect still present till date one will 
wonder if a terrorist could lay hand on such dangerous weapon. The non-proliferation 
treated geared toward preventing states from acquiring nuclear weapons was successful, 
in part, because they addressed the motives of non-nuclear states aspiring to acquire 
nuclear weapon such as Iran. But presently, it is more complex because the effort is now 
directed towards non-state actors which are usually invisible in the case of a terrorist 
organization. Nuclear terrorism refers to diverse means in which terrorist can utilize 
nuclear material to achieve their objectives, this includes; the bombing and destruction 
of nuclear facility to wreak havoc and cause economic damage to the community, the 
purchase, production, and utilization of nuclear material to coerce, maimed and instill 
fear into a civilian population. The action of Aum Shinrikyo in Japan where a religious 
cult terrorist group attacked the Tokyo subway with sarin gas shows that terrorism has 
no limit [5]. So far, two terrorist groups had struggled to acquired nuclear weapons, 
al-Qaeda and Aum Shinrikyo [5, 8] both groups struggled multiple time to buy nuclear 
weapons from Russian unsuccessful, Aum Shinrikyo in an effort even recruited scientist 
with a plan of developing nuclear weapon and even bought a ranch in Australia to enrich 
Uranium [5]. 


3.2 Cyber Terrorism 


The rapid advancement of information and communication system has paved a new 
wave of technological terrorism also refer to as cyberterrorism, the psychological fear of 
cyber terrorism is more than the actual action, the media, politician, military and security 
experts make cyberterrorism sound more frightening than conventional terrorism [21]. 
Imagine if someone could hack the nuclear defense system and push the red button? What 
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about the traffic system, the banks, and the dams? these are the frightening stories about 
cyberterrorism in which the media and politicians expressed to the public. Weimann 
[21] contends that there is some sort of financial benefit to both politicians and military 
personnel’s when such fear is expressed to the masses thereby promoting massive funding 
to protect the cyberspace and consequently an increase to defense budget even without 
any real threat [21]. While they have been more contested meaning to cyberterrorism 
owing to the fact that no real such case of cyberterrorism has happened, [15, 21]. Denning 
[6] surmises a broadly acceptable definition of cyberterrorism and thus alludes that 
“the convergence of cyberspace and terrorism where unlawful attacks and threats of 
attack against computers, networks, and the information stored therein are carried out 
to intimidate or coerce a government or its people in furtherance of political or social 
objectives, and should result in violence against persons or property, or at least cause 
enough harm to generate fear. Attacks that lead to death or bodily injury, explosions, 
plane crashes, water contamination, or severe economic loss would be examples. Serious 
attacks against critical infrastructures could be acts of cyberterrorism, depending on 
their impact.” A terrorist can use the computer as a weapon such as stealing the identity 
of people or users, spreading computer viruses and malware, hacking, destroying or 
manipulating important data [7]. 


4 Means of Combating Terrorism 


Counterterrorism is the means employs by states or security expert to prevent terrorism. 
Usually, the best way to counterterrorism is predicting the likely attack of a terrorist 
and this is where machines learning plays a vital role. In the past countering terrorism 
has been the affair of politicians with the decision to deploy the military and security 
experts, but because the action of a given terrorist is unknown. It is necessary to start by 
classifying the attacked type, weapons type, target time based on the previous history. The 
prevention of terrorism in the past has been based on the cause and nature of terrorism, 
for instance, in a situation where a failing state or failed state is the cause of terrorism 
like Libya, Somali, Iraq already discussed above, such a state is given assistance both in 
term of security and financial aid in order to strengthen the institution of the state. Also, 
the spillover effect is also another issue, where a particular country is having conflict, the 
probability P that the conflict will move to a given neighbor T is T = 1/n and therefore 
such country is required to strengthen its security and prevent the flow of combatant from 
country X to Y. However, all these strategies worked to a certain degree but terrorism has 
dramatically increase thereby making mankind to look for a new approach in preventing 
terrorism. These new approaches include the use of machine learning models to predict 
and prevent a future terrorist attack, these approaches will be vividly explained below. 


4.1 Counterterrorism Model, Machine Learning Approach 


In this section, we review some of the approaches that had been used to predict or 
forecast future terrorist attack (Table 1). The use of computer technology in predicting 
the likelihood of a terrorist attack is a new trend in research. 
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Table 1. Showing various model of forecasting terrorist attack. 


No. | Model | Proposed model | Uses/action Authors 


1 SNA STONE Identify and predicting the Subrahmanian (2016) 
successor of a removed 
terrorist from Network 
(Al-Qaeda, Hamas, 
Hezbollah, Lashkar-e-Taiba) 


CT-SNAIR Tracking and detecting Weinstein et al. (2009) 
terrorist scenarios and target 


SMD Had great success rate to Coffman and Marcus (2014) 

predict if one is terrorist or not. 

2, DBN_ | TAPM Detect the rate of attack Manoj (2009) 
transport areas 

3 TGPM Predict the group responsible | Sachan and Roy (2012) 
for an attack. 

4 MMURT Predict the method of Chuang and Maria (2016) 
radicalization and terrorism 

5 Hawkes process Predict the likely attack of Hawkes (1971), Tench (2018) 


terrorist e.g. IRA 


The nature of terrorist attacks is such that one must always be ready because a terrorist 
act in a surprising way no one least expected and at such, predicting their actions is a 
better way of countering terrorism. 


4.2 Social Network Analysis Model 


The advancement of the internet has ushered in a new form of communication, unlike 
the tradition media, the social media is becoming increasingly popular with terrorist 
using the social media in several ways to achieve their objectives, the radicalization, and 
recruitment of ISIS members were mostly done via the social media and hence the social 
network analysis model has become a useful tool in tracking and identifying terrorism 
[20]. The Social Network Analysis SNA is a method used in appraising human social 
interaction and such an approach has been used to investigate patterns and community 
structures [19]. The simulated datasets modeled (SDM) which was championed by [19] 
with a two-class classification for terrorist and no-terrorist and got 93% prediction rate. 
This model is useful to identify whether someone is a terrorist or not. 

Weinstein and colleagues [22] used simulation of terrorist attacks based on real 
information about past attacks and proposed a model known as Counter-Terror Social 
Network Analysis and Intent Recognition (CT-SNAIR) which was used in tracking and 
detecting a terrorist. The use of intent recognition tool was vital in detecting a potential 
terrorist target. They made use of real-time scenarios herein referred to as forensic 
scenarios from the September 2004 bombing of the Australian embassy in Indonesia as 
well as non-real scenarios which were developed in a prior project for research in social 
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network analysis. The utilization and application of a new Terror Attack Description 
Language (TADL) and with the use of the Hidden Markov Model (HMM), the criteria 
for modeling and simulation of terrorist attacks was developed. 


Shaping Terrorist Organization Network Efficacy (STONE) 

Subrahmanian [20] used social network analysis and proposed a model known 
as Shaping Terrorist Organization Network Efficacy (STONE) which has three key 
objectives: 


e Quantifying Terror Network Lethality 
e Predicting Successors of a Removed Terrorist 
e Identifying Who to Remove 


This method takes into cognizance the network structure of a terrorist organization 
as well as the vertices herein which represent operational roles, it is assumed that every 
individual in a network has a role to play irrespectively of the hierarchical structure. 
The word used in the study such as “removed” or “delete” referred to the targeted head 
of network or most important person in the network. The killing or neutralization of a 
known terrorist such as Osama Bin Laden is considered to have been “removed” from the 
Al-Qaeda organizational network. The network here refers to the organizational network 
rather than a social network, STONE is a framework to identify a set of k nodes whose 
removal will optimally destabilize a terrorist network. The STONE approach was tested 
on four terrorists’ networks namely Al-Qaeda, Hamas, Hezbollah, Lashkar-e-Taiba and 
got 80% success rate of prediction. 


Predicting Successors of a Removed Terrorist 

Once a terrorist has been removed from a network, he will be replaced and predicting 
his replacement and deleting him from the network is a fundamental step in destroying 
the network. In order to predict who will replace a removed terrorist, the author made 4 
assumptions as seen below. 


Assumption I. It is assumed that the replacement vertex u is not too far away from v in 
ON 


Assumption IT. Assume that the probability that v is replaced by a vertex u depends on 
u’s rank in the network. 


Assumption IIT. We assume that individuals with a higher weight than v does not desire 
v’s position. 


Assumption IV. It is believed that ON will rebuild itself to be maximally lethal thus 
when v is deleted from the network, u is selected in a way that maximizes lethality”’. 


For instance, it was already forecasted that should Osama Bin Laden been neutral- 
ized or removed, he will be replaced by his right-hand man Ayman al-Zawahiri and 
hence removing him too become inevitable. One of the most fundamental aspects of 
this algorithm is the Person Successor Problem (PSP): which set of k (k > 0) terrorists 
should be removed in order to minimize the lethality of the terrorist network. 
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Suppose ON = (V, E, wt, §9)is an organizational network, r € V is a vertex, and B 
is a logical condition on VP. The person successor problem is the problem of finding a 
vertex v € V such that the replacing person must meet the conditions of the “removed”. 

The logical condition B preventing who can replace a removed terrorist r is simple 
and understandable. For instance, if r is an IT engineer (i.e. go(r, IT engineer) = 1), 
then replacing him with a cook(chef) would not yield fruit. Therefore, the professional 
involves are required to outline that replacement node must have certain features of an 
IT engineer which may be difficult finding another IT engineer and hence, leading to 
network destabilization. Furthermore, the strength of this model lies in the fact that it is 
not based on removing organizational head such as leader but identifying who plays a 
pivotal role in the organizational network in a way that removing such an individual will 
create network malfunction and difficult to find a replacement. In terrorist a network, 
bomb-makers, computer experts and machines operators often play key roles than those 
just giving orders, those giving command can easily be replaced if removed such as the 
replacement Osama Bin Laden. 


4.3 Dynamic Bayesian Network 


According to Hudson and colleagues [11] Bayesian network is a useful tool for antiterror- 
ism risk assessment. In a profiler installation security system planner, Bayesian network 
was used, and it gives the user more accurate prediction in managing attacks. 

Manoj [14] develops a terrorist attack prediction model TAPM using dynamic 
Bayesian networks (DBNs). His study was motivated based on the September 11 attack 
where terrorist hijacked four US airplanes, the method is aimed at predicting future 
attack in transport facilities such as airports, metro and subway systems, bridges, and 
tunnels. This Dynamic Bayesian Network model is based on the premise of uncertainty 
and being dynamic and ready at all time for uncertainty. Manoj [14] explains that in 
a situation where a passenger at a given airport is suspected of being a terrorist and 
intelligence are gathered about the particular passenger and maybe possible detention 
and interrogation, and after investigation for about two to three days, they discovered 
such person was not a terrorist, the security system may be relax, and it will restart the 
process again and give another prediction taking into cognizance the previous failure. An 
example is the multiple threats to the New York subway system which indicate that DBN 
may be capable of updating the threat level over time if new information is available. 


4.4 Terrorist Group Prediction Model (TGPM) 


Terrorist Group Prediction Model (TGPM) or Predicting the Perpetrator Model (PPM) 
is one of the best models in predicting the likelihood of which group is responsible for an 
attack (Fig. 1). When we predict which group is responsible for an attack, counterattack 
measures can be developed based on the group’s history. Terrorist behavior is dynamic 
based on counterterrorism response, attackers copy tactics from other terrorist group 
based on success rate, if a terrorist group T successfully carry out an attack in target x, 
the weapons type and approach used will likely be employed by another terrorist group 
and thus, this approach study the behavior of terrorist group and predict which group 
is responsible for an attack. Terrorist Group Prediction Model (TGPM) championed by 
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Abishek Sachan and Devshri Roy [2] uses similar features of past terrorist attacks and 
forecast the perpetrator or the group responsible. They employ certain concepts such as 
Group Detection Model (GDM) Crime Prediction Model and Offender Group Detection 
Model (OGDM). The study focuses on India as the case study and using a dataset 
from the Global Terrorism Database (GTD) which includes the history of a terrorist 
attack in India from 1998 to 2008. This method makes use of input data such as attack 
type, location of the attack, suicide attack, weapon type, target type, etc., which were 
used to calculate the group weighted point. The percentage of attacks by each group is 
calculated. Clusters are created based on the percentage of attacks of each group and the 
parameter weights. Association of the input data with the formed clusters is calculated 
and the highest value is chosen. The group name matching to the cluster with the highest 
association value is predicted to be the group responsible for the attack. The results show 
80.41% accuracy rate. The main limitation to this model is the fact that it is a reactive 
or responsive model rather than proactive, predicting which group is responsible for an 
action which has already taken place rather than preventing such action yields very little 
fruit. Couple to that, today’s terrorism has gained enormous media attention such that a 
group will quickly claim responsibility for an action is carried out without necessarily 
waiting for a predictor. This, therefore, makes such a model redundant and not suitable. 
In addition, since this model relies heavily on data from historical evidence, it becomes 
more difficult to predict and identify smaller groups with little history should they carry 
out an attack using similar techniques with patterns that match that of a larger known 
terrorist organization. Lastly, this model is not a dynamic model and terrorist actions are 
evolutional and dynamic in nature. For instance, the predictor was using weapon type 
and attack type as a feature and the terrorist group manufacture a new weapon and attack 
strategy which is not found in their history, the model will not be accurate. Nonetheless, 
the TGPM remains a good source of knowledge in the research for countering terrorism. 


4.5 Mathematical Models for Understanding Radicalization and Terrorism 
MMURT 


In this section, Chuang and Maria [3] proposed Mathematical Models for understand- 
ing Radicalization and Terrorism MMURT. The increase in terrorism is as a result of 
radicalization and therefore understanding how and why people become radicalized is a 
necessary step in preventing terrorism. This approach is based on the notion that every- 
one has an opinion and the belief of an individual in a given society can change or be 
influenced via peer to peer contact or through the media [3]. Radicalization usually takes 
place through networks of peers and can be enhanced through the internet or other tech- 
nological means mostly in the form of web-based recruitment material or chat rooms, 
ISIS are zealous for this style and often use internet-savvy group, that uses social media 
to recruit western foreign fighters in Syria and Iraq [3]. 
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Fig. 1. Terrorist Group prediction Model (TGPM) Source; [2] 


4.6 Hawkes Process Modeling 


The Hawkes process is useful in predicting event occurrences within a short-term. Devel- 
oped in 1971, the Hawkes self-exciting point process model explained that event in itself 
is self-exciting and independent from each other in such a way that a successful occur- 
rence of an event will trigger another occurrence, i.e. a given event raises the chances 
of another event in the future [14]. Although it was developed to predict the possible 
occurrence of an Earthquake, Tench (2018) used it as a machine learning approach to 
countering terrorism by predicting terrorist acts such as the case of the Irish Republican 
Army (IRA). Terrorist actions are evolutional and similar in cases where they are suc- 
cessful, just like the Hawkes process model, when a terrorist successfully carries out an 
attack, there is excitement emanating from the success and such excitement increases 
the chances of further attack within a short-term period. 


5 Conclusions and Future Works 


Determining the likelihood that a terrorist may carry out an attack is a crucial step in 
counterterrorism strategy. States are putting countless efforts in securing their territorial 
boundaries from external aggression (aggression from another state). But, there is a new 
form of security challenges emanating from terrorism that global efforts at combating 
it are becoming a potent challenge. Our study is therefore important for not only in 
academia but for security experts and policymakers to draw ways of combating and 
preventing terrorism. Most information about terrorist activities is gotten from the media 
and terrorism database site such as the Global Terrorism Database. So far, no survey was 
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found on models predicting terrorism and it makes it more difficult to compare previous 
studies. furthermore, another challenge is the fact that most of the models proposed did 
not focus on a specific case study and hence study is needed to compare a case by a case 
study of several terrorist groups. One study such as the Hawke process was applied to IRA 
terrorist group and was successful, but this model did not take into cognizance the goals 
of such terrorist group and hence, in future work, one needs to take into consideration 
the types of a terrorist group and their goals. The IRA, for instance, has as main objective 
to gain an autonomous state of Ireland and their action became dormant when Ireland 
became an independent state, while other groups such as ISIS has a wider goal such as 
installing a global caliphate and making the world an Islamic one, finding measures of 
tackling them is even more difficult. 
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Abstract. Nowadays, cyber-attacks are targeting mobile devices, bank 
accounts, connected vehicles and cyber-physical systems. These attacks 
are becoming more complex and are raising safety problems when tar- 
geting physical environment. An efficient way to protect against these 
attacks is making several security actors collaborate in defining appro- 
priate countermeasures. However, in practice, security actors refrain from 
collaborating to avoid sharing their proprietary security processes. These 
processes represent a critical knowledge as they reflect these actors brand 
images. In this work, we investigate the use of homomorphic encryption 
to define a privacy preserving framework for sharing processes between 
different cybersecurity actors and for providing confidential data analy- 
sis. We describe a high level design for a secure cloud platform managing 
encrypted data. The data analysis algorithms provided by the cloud plat- 
form are designed with our open source tool Cingulata, which enables 
designers to implement any data analysis function, compile it and run it 
on homomorphically encrypted data. 


Keywords: Homomorphic encryption - Private data sharing 
agreements - European data economy - Secure data exchange - 
Platform security - GDPR 


1 Introduction 


A survey by the European Union Agency for Network and Information Security 
(ENISA) [14] affirms that three-quarters of the businesses faced cybersecurity 
issues during the last decade. The majority of respondents believed that their 
organisations had been victims of targeted attacks, and almost a third of them 
reported a significant business impact. In addition, the advent of Internet of 
Things (IoT) raised the number of connected devices and so increased the attack 
surfaces of many IT systems. The increased number of attacks resulted in a great 
demand of security protection mechanisms. These mechanisms in turn require 
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collecting data, either for incident reporting, for attack prediction or for system 
protection improvement. 

Current IT services generate an increasing number of data regarding several 
events. Events come from various origins such as: system logs for access con- 
trol, hardware traps raised by hypervisors, applications audit trails that identify 
business transactions. The diverse nature of these events with the 5 V! storage 
challenges of big data, make it difficult to perform either real-time detection 
of exceptional events-aimed to promptly take corrective actions, or deep and 
narrow historical data analysis aimed at identifying warning signs of security 
threats. 

In the work with C3ISP [4] and KONFIDO [20] projects, we specify a plat- 
form to create an efficient and flexible framework for secure data analytics where 
data access and data analysis operations are regulated by multi-stakeholders 
data-sharing agreements. 

The rest of this paper is structured as follows. In Sect.2, we evaluate the 
benefits of developing a flexible framework, which allows confidential and collab- 
orative information sharing and analysis among relevant security stakeholders. 
Based on the architecture of the C3ISP framework, we further propose a solution 
for privacy aware cloud platform design with a support for homomorphic encryp- 
tion in Sect.3. We evaluate this platform via an industrial use-case detailed in 
Sect. 4. Finally, the Sect.5 concludes the paper by presenting the limitations of 
this work and its future enhancements. 


2 Information Sharing and Privacy-Aware Design 
Principles 


Critical infrastructures like energy, water and nuclear ones receive increased 
attention globally due to its importance to our society or national defense. For- 
tunately, thanks to increased awareness and attention by government agencies 
and company management in these domain, the security of these systems is 
improving. However, substantial challenges remain a bottleneck to information 
sharing between these companies, the government and regulation authorities due 
to the critical nature of data and sovereignty issues. 

In this section, we summarize some of the actions taken to date - with empha- 
sis on the European Union experience - to encourage and facilitate improved 
information sharing. 


2.1 Benefits of Information Sharing in Mitigating Cyber 
and Physical Threats 


A simple way to look at the information sharing is as data flows between gov- 
ernment, private business and citizens. Essentially optimal information sharing 
allows an open flow of ideas and concerns from the government to the businesses 


' 5V big data challenges: Data Volume, Velocity, Variety, Veracity and Value. 
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and citizens, and equivalent flows of information from the businesses/private cit- 
izens back to the government. And, of course, the flow of information between 
the citizens and the businesses is also desired. Unfortunately, the open and unim- 
peded flow of information between these different institutions and individuals can 
be slowed-or even blocked due to a variety of challenges such as politics, intel- 
lectual property protection, legal concerns, sovereignty and ultimately a lack of 
trust. 

There is a significant advantage on sharing information in terms of increasing 
the capability of attacks detection and prevention. In the past years a number of 
security attacks with very serious consequences have been performed. Noticeable 
examples are the Flame and Conficker malware [8] which caused a loss pf several 
millions of dollars. These attacks have been successfully tackled thanks to a 
collaboration among security and business companies. These companies shared 
relevant information, whose collaborative analysis has been vital in detecting 
the features to prevent upcoming attacks of this kind. Indeed, as noticed also by 
the activities of the WG2, of the Network and Information Security (NIS) EU, 
on Information Sharing [19], there are several benefits in information sharing for 
cybersecurity (including incident notification) as well as several barriers to be 
removed. Finally, the report on Cybersecurity Policies and Critical Infrastructure 
Protection [18] confirms that sharing information on security threats between 
security actors is the best way to reinforcement against cyberattacks. 

Some benefits of information sharing for cybersecurity are: 


e Faster attacks prediction and then reaction 
e Collaborative threats analysis 
e Increasing knowledge on attacks thwarting 


However, there are also several barriers to sharing cybersecurity data. Some of 
these barriers are related to the provision of proper access control mechanisms 
to cybersecurity data [19]. Indeed, organizations fear the risk related to their 
brand images and the loss of their reputation if the shared information reveals 
publicly their cybersecurity incidents. There are also other concerns related to 
the compliance with legislation such as the GDPR with restricts the access to 
personal information. 
Some barriers of cybersecurity information sharing are: 


e Lack of trust on entities sharing data 
e Lack of such systems as commodity 
e Lack of control on shared data 


In our works, we aim at developing a technological and procedural framework 
to unleash the power of information sharing for collaborative analytics for cyber 
protection in a confidential manner. The framework allows data producers/con- 
sumers to easily express their preferences on how to share their data, which 
operations can be performed in such data and with whom the resulting data 
can be shared etc. This entails a framework that combines several technologies 
for expressing and enforcing data sharing agreements as well as technologies 
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to perform data analytics operations, in a way compliant to these agreements. 
Among these technologies, we can mention data-centric policy enforcement 
mechanisms and data analysis operations directly performed on encrypted data 
provided by multiple producers/consumers or prosumer. 


2.2. Privacy-Aware Platform Design 


The privacy-aware platform design consists of defining the component patterns 
and criteria in the respectful ways to approach privacy and data collection. 
A right way of exploring data will allow detect unauthorized data requests, 
malicious third-party tracking and offboarding experience. One of interesting 
book to read and apply in practices is the one of Fair Information Practices 
proposed by the Department of Health, Education, and Welfare (HEW) in a 
1973 study entitled Records, Computers, and the Rights of Citizen [16]. The 
resulting recommendations are the following: 


1. Provide full description of data collection and exploitation purpose: The full 
description on data usage is considered as a critical element of public data 
collection, and was a sharing consent form of the Fair Information Practices. 
This consent form allows to prosumer know in the transparent way the usage 
on their data and the constraint according to the privacy policy under which it 
was collected like: the duration of data exploitation and purpose, the measures 
in terms of storage for data protection, the analysis model based on which 
customer data would be tread. From the point of view of legal implications, 
privacy-related architecture design need to clarify and figure out the questions 
of legal capacity, adequate understanding in a simple way for architecture 
design in order to highlight any residual and legal risks [10] during architecture 
audit. Another purpose for the sharing consent form is a pedagogical purpose, 
it implies user awareness about the presence of data collection and data usage. 

2. A minimal and sufficient data collection for a specific analysis model: “Per- 
sonal data” could be revealed the behavior, thoughts, and/or preferences of 
an individual, it is very important for product development, but very sensible 
for the development and testing phase. Any data that are used in these phases 
need to be anonymized or homomorphically encrypted before. 

3. A minimal identification of data with individuals: in the use case of telco, 
the software of 360-degree customer management sometimes requires collect- 
ing customer information like user equipment, user position (GPS) and user 
behavior when surfing the web. In this critical point, when defining a plan of 
privacy - aware platform design, an independent and isolated authentication, 
identification with authorization authority would be planned to implement 
and it could be necessary and sufficient for future data processing systems. 

4. Minimize and secure data retention: Data usage need to be directly related 
to a feature of an application. If data is restrained and stored, then it must 
be protected in an auditable secure way. The minimum security requirement 
is to store data with encryption technology in a manner that data disclosure 
is difficult or impossible without decryption process. In addition, the proof of 
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data nonreusability is always required, which ensures to customers inability 
to reuse of stored data. 


3 Privacy Preserving Computation on Cloud Platforms 


In this section, we describe the basic component of a secure cloud platform with 
distributed constraints. 


3.1 Secure Content-Based Routing 


Secure Content-Based Routing (CBR) is a known event - driven architecture 
which supports to the communication of machine processors or micro-services 
in distributed cloud platform. This technical design described with high level 
in [13,18] allows routing messages based on their content rather than their des- 
tination and simplifying with reducing linkage between micro-services or diverse 
functionalities in cloud application. The first principle of this technique is to 
favor data production/consumption decoupling: synchronisation decoupling with 
maximizing asynchronous and anonymous communication; space decoupling: 
unknown data producers and data consumers; time decoupling: production and 
consumption at different times. The second one is to allow scalability in terms of 
message volume per minute, data volume per second and finally connection vol- 
ume (producers and consumers) at a given instant. In the context of cloud appli- 
cation development, this kind of technique is known as distributed event-based 
systems, distributed publish-subscribe systems, distributed messaging service, 
message-oriented middleware, active databases. 

To secure data communication between producer (publisher) and consumer 
(subscriber), a communication mechanism described in [15] can be improved 
by adding an extra security layer to the process. This mechanism ensures and 
reduces the threat to the data confidentiality and integrity. Moreover, using SSL 
connection with strong authentication may offer a secure protocol for exchanging 
confidential data content like cryptographic keys between both ends of the com- 
munication chain. As publications and subscriptions are encrypted and signed, it 
becomes easier to detect a malicious attack or data corrupted. Figure 1 presents 
a secure event driven architecture 


3.2 Secure Data Access Control 


The standard security requirements of confidentiality, integrity and availability 
(CIA-triad) are at the core of our data-centric protection vision. In addition 
to these traditional security requirements, also non-repudiation, authentication, 
authorisation, and accountability are among the top priorities of private data 
sharing. 


1. Confidentiality: the property of protecting the secrecy of data and only dis- 
close it to authorised parties under the policies specified in their Data Sharing 
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Fig. 1. Multiple produces send messages to a partitioned log, consumed by multiple 
consumer groups. 


Agreements (DSAs). Specifically, encryption has been traditionally used for 
preserving data confidentiality in transit and at rest, and just lately homo- 
morphic encryption has started emerging as a novel approach for ensuring 
data confidentiality in use; 

2. Integrity: the property of preserving the data from fraudulent modification; 

3. Availability: the capability to have the data available when they are requested 
(this is important for a quick reaction to attacks). 

4. Non-repudiation: the fact that one party cannot deny of having submitted 
data, can help as a deterrent countermeasure to limit a malicious party to 
submit bad data; 

5. Authentication and authorisation: both the data sharing and the consumption 
of analytics results have to be protected by access control mechanisms and 
conditions. DSAs regulate how access and usage control protect the cyber 
threat information; 

6. Accountability: especially to address compliance mandatory requirements or 
to help internal investigations, assess the correctness of system processing, 
etc., our platform has to be able to trace and identify the right entities or 
people that participate in the DSA-regulated federation and be able to under- 
stand that the policies stated have been correctly and effectively enforced. 


3.3. Homomorphic Encryption-Based Cloud Analysis Platform 


Homomorphic encryption (HE) is an encryption method which allows to perform 
computation on encrypted data without decrypting it. Such schemes are known 
to be very useful to construct privacy preserving protocols even in its classi- 
cal version. As example, homomorphic encryption has been used as a key-tool 
in the popularisation of electronic-based voting scheme. Another application of 
homomorphic encryption is Private Information Retrieval, which is a commu- 
nication efficient interactive protocol which allows a user to retrieve an item 
in a database without revealing which item he is looking for. This paradigm 
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has found a number of applications in numerous contexts: private searching, 
keyword search, private storage, anonymous authentication, etc. Another very 
popular scenario which makes the benefits of homomorphic encryption is cloud 
computing: a user relies on some computing resources from a cloud provider 
to perform expensive computation on sensitive data. These scenarios have in 
common that Fully HE (FHE) encryption is used as a method which allows the 
scrambling of data in order to protect their confidentiality via the execution of 
algorithm on encrypted data. In real world applications using FHE encryption, 
one or several entities interact with the cloud. To preserve privacy of each user, 
the data are sent encrypted over the cloud. The service provider processes the 
received data homomorphically and sends the encrypted result to an end user 
(owning the FHE parameters and, hence its secret key). The latter one decrypts 
the result using its own decryption key. Here, the service provider can compute 
almost any functions over the encrypted data and acts transparently with respect 
to each entity using only public information and encrypted data (Fig. 2). 


| compute R = f(X,Y) without 
ever decrypting X nor Y 


Fig. 2. Homomorphic encryption allows computation on encrypted data without 
decrypting it in untrust environnement. 


The underlying mathematical objects used to conceive fully homomorphic 
encryption schemes are euclidean lattices. The security of almost all known FHE 
construction relies on the problem of finding short vector or basis in a high 
dimensional lattice. Gentry’s solution relies on ideal lattices over algebraic num- 
ber fields. In 2012, Brakerski, Gentry and Vaikuntanathan (BGV) [2] improved 
this scheme without using bootstrapping; they proposed a generalized construc- 
tion secure under the popular Learning With Errors assumption and its ring 
variant. Then Brakerski [1] proposed a new scale invariant scheme that does 
not require modulus switching. In 2012, Fan and Vercauteren (FV) [6] proposed 
a ring variant scheme and improved its efficiency. The so-called BGV and FV 
cryptosystems which are already implemented in version Cingulata 1.0 [17]. The 
3rd generation of FHE with fast bootstrapping techniques called TFHE - Fast 
Fully Homomorphic Encryption over the Torus based on [5,9] is released in the 
version Cingulata 2.0 since June 2019. 

Our Cingulata open source [3] offers a compiler chain with high-level language 
development targeting HE execution based on manipulating Boolean circuits. 
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That is a directed graph G = (V,A) which vertices are either inputs, outputs 
or operators (XOR, AND) and which arcs corresponds to data transfers. The 
following constraints are imposed to a compiler targeting HE execution [7,12]: 


— No ifs (unless regularized by conditional assignment). 

No data dependant loop termination (need upper bounds). 
Array dereferencing/assignment in O(n) (vs O (1)). 

— Algorithms always realize (at least) their worst-case complexity! 


In terms of technology design, this compilation chain is composed of 3 layers: a 
front-end, a middle-end and a back-end. The front-end transforms code written 
in C++ into its Boolean circuit representation. The middle-end layer optimizes 
the Boolean circuit produced by the front-end. The back-end homomorphically 
executes the Boolean circuit over encrypted data. Two HE libraries are supported 
by our Cingulata compiler: (i) an in-house implementation of [6] and (ii) the 
publicly available TFHE library. 
A simple “hello world” example written using Cingulata is: 


Cilnt a{Cilnt::u8}; // create an unsigned 8—bit variable 
Cilnt b{Cilnt::u8v(42)}; // use helper function to create 

// an unsigned 8—bit 
Cilnt c{-1, 16, false}; // or manually specify value, 

// size and signedness 


a.read(”a”); // read variable a and b 


Using the FV cryptosystem this program is homomorphically executed in under 
5s and using TFHE in under 1s. 


4 Use Case: A Privacy Preserving Detection of Brute 
Force Attacks 


This analytics works on connection request logs and identifies whether the des- 
tination addresses belong to malicious hosts. Connection logs are directly taken 
from a router that use the Netflow V9 protocol and pushes the information to a 
client software that collects the logs. This service is run with a combination of 
a Data Manipulation Operations component before linking to the homomorphic 
encryption operation. We use CEF [11] format to crawl log data, the format CEF 
is an extensible, text-based format designed to support multiple device types by 
offering the most relevant information. Message syntax’s are reduced to work 
with ESM normalization. Specifically, CEF defines a syntax for log records com- 
prised of a standard header and a variable extension, formatted as key-value 
pairs. 
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The following example illustrates a CEF message using Syslog connection: 


20 01:25:12 host 

CEF :0|Router_Vendor |Router_CED|1.0]100| Connection 
Detected|5|src=192.168.1.31 spt=22126 dst= 
214.141.161.177 dpt=24920 proto=TCP 
end=1505462161000 dtz=EuropeBerlin 


For a high performance with trade-off between the security levels and keep 
FHE analysis in practical usage, we propose four configurations dubbed: 


— BLACK_LIST_FULL 
— BLACK_LIST_HIGH 
— BLACK_LIST_ MEDIUM 
— BLACK_LIST_LOW 


They permit to establish a trade-off between security and efficiency by 
encrypting respectively the last i=4, 3, 2, 1 byte(s) of each IPv4. Let us take 
an example with an address IPv4, it is with this form a.b.c.d. We use [2] fne fhe 
to design the data x encrypted with FHE. 


— BLACK _LIST_FULL : [a] pne-[b] fne-[el pne-[] tne 
—~ BLACK_LIST_HIGH : a.[b] pne-[ fne-[d] rne 

~ BLACK_LIST_MEDIUM : a.b.[c] fne-{d] pne 

~ BLACK LIST _LOW : a.b.c.[d] fhe 


We have used a compute server W3520 @2.67 GHz, 128GB Ram with 48 
cores CPU and consider a result of 32 bit, if the result is positive then the IP is 
belong to Blacklist, otherwise it is not. In terms of performance of analysing a 
list size of 320 IPs, running only with 8 CPU, it means that we compute 1 CPU 
for 40 IPs, i.e processing sequentially with 4 slots of 10 IPs. Running fully with 
48 cores and using parallel processing, we can verify up to 1920 IPs (6 x 320I Ps) 
per request with the same length of time. 


FV. With option FV library in Cingulata, the runtime is around 56s and 
115s with respectively BLACK_LIST LOW and BLACK_LIST_MEDIUM con- 
figurations whereas it takes 170s and 415s with BLACK_LIST_MEDIUM and 
BLACK _LIST_FULL. 


TFHE. With option TFHE library in Cingulata, using BLACK_LIST_FULL 
and BLACK_LIST_HIGH configuration, it consumes respectively 17s and 13s 
whereas with BLACK_LIST.MEDIUM we obtain the result after 8s and 
BLACK _LIST_LOW we only wait after 4s. 
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FV* TFHE* 

320 IPs*]1920 IPs”/320 IPs*]1920 IPs? 
BLACK_LIST_LOW 56s 56s 4s 4s 
BLACK_LIST_MEDIUM]115s 115s 8s 8s 
BLACK_LIST_HIGH 170s 170s 13s 138s 
BLACK_LIST_FULL 415s 415s 17s 17s 


*: a computed server W3520 @2.67 GHz, 128 GB Ram with 48 cores CPU. 
*: with the same compute server but with 8 CPU. 
>: with the same compute server but with 48 CPU. 


5 Conclusion 


In this paper, we have shown the benefits of information sharing in mitigating 
cyber and physical threats, we have discussed privacy-aware design principles 
for user awareness in sharing data consent, and implicating the clarification 
for design architects to be aware of data usage and data collection as well as 
auditability of privacy - related design with regards to legal consequences. We 
then concluded by describing a high level design for two critical components: 
secure content - based routing and secure data access control which are inte- 
grated in our cloud - based analysis platform with homomorphic encryption 
technology. We presented the Cingulata compilation tool-chain for fully homo- 
morphic encryption technology and concluded by the use case detection of brute 
force attacks. We show the Cingulata performance with TFHE and F'V libraries, 
3rd generation of FHE technology which was already integrated in our Cingulata 
library and available in open source github. In comparison with our solution ded- 
icated to enterprise partner which currently is in development, this open source 
compiler still shows some constraints of usage, integration and implementation 
for industrial usages. However, the compiler design allowing developers working 
with high-level languages and the HE performance based on TFHE library are 
impersonated. For the future work activities, we will improve the performance 
of Cingulata open source and also define optimisation works for Boolean circuits 
with optimizing in memory consumption and execution times, we also prepare 
some works to adapt Cingulata with Artificial Intelligence technology. 
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Abstract. The paper proposes a secure and robust stereo image watermarking 
scheme based on chaotic logistic map encoding and frequency transformations. 
The chaotic logistic map encoding is employed to confuse both original stereo 
images and watermarked images. Since the high sensitivity to initial conditions 
introduced in the chaotic logistic map, a huge key space is provided for encod- 
ing images. Thus, the proposed system has a strong secure capability to resist 
brute-force and statistical attacks, in which the evaluation goes through various 
analysis methods such as sensitive key analysis, correlative adjacent pixels anal- 
ysis. Moreover, to enhance the robustness of the watermarked image, the discrete 
cosine transform (DCT) and the singular value decomposition (SVD) are exploited 
for this purpose. Based upon the properties of the SVD and utilizing the advantages 
of the DCT, a watermarked image is embedded in the singular value. Performance 
evaluations show that the proposed watermarking scheme for stereo images is 
highly secure as well as strongly robust against different kinds of attacks. 


Keywords: Data hiding - Watermarking - DCT-SVD - Chaos - Logistic map 


1 Introduction 


The explosion of computer networks and information technology in recent years have 
led to the revolution of the communication field. The transmission of information among 
activities has become more prominent. Every minute, a huge amount of digital infor- 
mation is exchanged among different users through the Internet. However, digital infor- 
mation is totally different from conventional information. The ability to copy and alter 
information and unauthorized distribution has changed dramatically. Thus, there is a 
need to have a secure method for sharing information. Depending on the application, 
different security method is considered. Digital watermarking is one of the most com- 
mon techniques that has been applied in copyright protection and authentication for over 
the last two decades [1-4]. 


© Springer Nature Switzerland AG 2020 
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In a digital image watermarking technique, a watermark is inserted into the cover 
image to protect it from unlawful practice. According to the watermark inserting process, 
digital image watermarking techniques can be grouped into spatial domain techniques 
or frequency domain techniques. In the spatial domain algorithms, the watermark is 
directly embedded in the cover image by changing pixel values [5—7]. The advantages 
of spatial domain methods lie in easy implementation and low computational complexity. 
But these methods are not enough robust to resist against image processing attacks. In 
contrast, in the frequency domain approaches, the watermark is embedded in coefficients 
in which pixel values are transformed into frequency coefficients. Many watermarking 
schemes in the frequency domain have been proposed [8-13]. For example, a water- 
mark is hidden inside the coefficients of the frequency domain after transforming. In 
2010, Lu et al. [8], proposed a robust watermarking scheme in which a watermark is 
inserted into the frequency coefficients after the discrete Fourier transform. In 2015, 
AL-Nabhani et al. [10], developed discrete wavelet transform (DWT) to embed a binary 
watermark image in selected coefficient blocks. Besides, there are many hybrid trans- 
form domain watermarking techniques such as [9], to increase the embedding capacity. 
Authors successfully combined integer wavelet transform (IWT) with Discrete Gould 
transform (DGT) for medical images. In 2018, Kang et al. proposed a novel hybrid 
of discrete cosine transform (DCT) and singular value decomposition (SVD) in DWT 
domain [11] that offers the high robustness and quality. In general, frequency domain 
watermarking can withstand image processing attacks better in comparison with spatial 
domain watermarking. 

Drawing on the state-of-the-art analysis of methods, watermarking algorithms were 
designed for mono-images. Recently, the advance of new technologies, many three- 
dimensional (3D) videos/images have been generated. Those images are taken by two 
cameras horizontally aligned and separated at a scalable distance similar to the distance 
between our eyes [14—16]. The stereoscopic 3D display that can be achieved by projecting 
on the left and right views to the specially designed screen is called a stereo image. 
By providing information about the 3D structure of scenes, stereo images are being 
used in various applications such as computer vision, medical surgery and autonomous 
navigation [17]. Of course, not except for digital information, unauthorized users have 
become possible to easily alter, copy and distribute stereo images over open networks 
without the permission of the original authors [18]. Ensuring the precision and integrity 
of stereo images is important. Digital watermarking technology is currently considered 
as an effective solution for authentication and copyright protection. 

There are several previous works in the field of stereo image digital watermarking [16, 
19, 20]. In [19], the authors proposed stereo image coding based on the DCT and SVD. 
Disparity extracted from the stereo pair is used as watermark and is embedded in the 
left views of the stereo image. The extracted watermark is normally good in visuality, 
but the scheme is weak against some attacks such as JPEG compression, cropping, 
filtering. Zhou et al. [16] propose a watermarking technique using hierarchical tamper 
detection strategy and stereoscopic matching in 3-D multimedia which can perform well 
for tamper detection and self-recovery in stereo images. In 2017, we devoted a robust 
hybrid watermarking scheme for stereo images copyright protection in transform domain 
[20]. Since the left and right views of a stereo image are not independent of each other, 
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the method matched a block in one view to the corresponding block in the other view in 
DCT domain. Then watermark is embedded in similar block pairs based on SVD which 
can meet a robust watermarking, compared to other algorithms. 

Most of existing stereo image watermarking schemes concentrate on the visual qual- 
ity and signal processing attacks. These approaches are just considering around payload 
and robustness. Therefore, if eavesdroppers know watermarking, they can evacuate. 
Once the information is revealed, such schemes are no longer secured [3]. To overcome 
this problem, a secure and robust watermarking scheme is proposed in the present paper, 
in which the original image and the watermark image are encrypted using the chaotic 
logistic map. In addition, to enhance the robustness of the scheme, a cipher watermark 
image is embedded in the cipher original image in the DCT and SVD transform. 

The rest of this paper is structured as follows. We, firstly, introduce the basic back- 
ground and mathematical theory in Sect. 2. Next, the proposed scheme is presented in 
Sect. 3. Then, performance evaluations of the proposed system are exhibited in Sect. 4. 
Finally, a conclusion is drawn in Sect. 5. 


2 Preliminaries 


In this section, we discuss step by step the procedure of chaotic logistic map which is 
used to encrypt a plain image in the proposed scheme. Subsequently, we outline the 
discrete cosine transform as well as the singular value decomposition. 


2.1 Chaotic Logistic Map 


A chaotic logistic map is a mathematical technique that deals with a nonlinear dynamic 
system. It is a kind of one-dimensional chaotic mapping, which is widely used for 
protecting image content in recent years [4, 21]. A logistic map function is defined as 
the following Eq. (8). 


Xj = wxi—-1(1 — xj~-1) with i running from 1 (1) 


where 3.57 < jt < 4 and 0 < xo < | are the parameters and the initial values. When 
the initial conditions are met, the generated sequence element x1, x2, ...,X, are neither 
periodic nor convergent. In addition, the function is very sensitive to initial conditions. 
In other words, even if the difference in initial values is so small, the iteration results are 
widely diverging. 


2.2 Discrete Cosine Transform (DCT) and Quantized DCT (QDCT) 


DCT is one of the techniques for transformations of the signal presentation from the 
spatial domain to the frequency domain. In the spatial domain, a digital image is con- 
sidered as a two- dimensional (2D) matrix, in which the gray-scale values of pixels are 
ranged from 0 to 255. Two-dimensional DCT (2D-DCT) function is adopted to convert 
images [22]. The 2D DCT function is defined as follows: 


Feu2) = eee? 5 ON pes, eos (SEF DHE ogg (SYED 
(2) 
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aa if u=0,v = 0; 


1, otherwise. 


where WN refers to the image block size, c(u) = c(v) = and 


F(u, v) is the DCT coefficient and f(x, y) is the pixel value. 
The inverse transform (IDCT) function is operated as follows: 


F099 Dagny Dayan ECO CH 108 (EOE Joos (Sox) 
(3) 


Quantization DCT is the step where we discard information which is not visually 
significant. The quantization of every coefficient in the 8 x 8 block is divided by a 
corresponding quantization value. The quantized DCT (QDCT) coefficient is calculated 
as the following Eq. (4): 


Fu, vg = round ( ae ») (4) 


where Q(.) is a quantization table. 


2.3 Singular Value Decomposition (SVD) 


SVD is a linear algebraic algorithm. A matrix A is decomposed as a product of column 
orthogonal matrix U, diagonal matrix S with r (rank of A matrix) nonzero elements called 
singular values of A matrix, and transpose orthogonal matrix V, that can be described as: 


A=U-S-v!? (5) 


where U-U? =1;V-V? =I. 

SVD is a useful tool to extract geometric features from an image. Thus, the basic 
idea behind the SVD technique in the watermarking scheme is to find singular values of 
the image and embed the watermark in it. As doing such method, watermarking scheme 
withstands geometric attacks. 


3 Proposed Scheme 


3.1 Watermarking Embedding 


Watermarking embedding process is classified into three stages. Stereo image encoding 
is presented in the first stage. Next, block matching is implemented in the second stage. 
Finally, watermarking embedding is implanted in the third stage. 


Stage 1: Stereo Image Encoding. A cryptosystem normally includes two mutually 
independent stages: confusion and diffusion. In this paper, we only address the confusion, 
in which all the image pixels are scrambled without changing their value. 

Assuming that, Pz and Pp are the left and the right plain-images of stereo pair sized 
of m-by-n. each image is reshaped to a 1-by-m x n. To break the relationship between 
adjacent pixels, a sequence random number is generated by using chaotic logistic map 
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given by the Eq. (8) with two initial conditions x9 = 0.1 and 1 = 3.75. The sequence 
random number (x1, x2, ...,Xmxn) 18 a useful indicator of image pixel permutations. The 
plain-image P;, and Pr are scrambled with the sequence random numbers to obtain the 
cipher-image S;, and Sr, respectively. 


Stage 2: Block Matching. Since the left and right views of the stereo image are con- 
sidered as similar. To improve the performance of a stereo image watermarking scheme, 
inter-correlation between two views is matching. The process is presented in the 
following steps: 


Step 1: Decompose S$, and Sr (of size m x n) into non-overlapping blocks B ie : and 
BR, sized of 8 x 8 pixels, with i € [0, (m/8)-1] andj € [0, (n/8)-1]. 

Sie 2: Transform image blocks using 2D-DCT to obtain DBY He and DBR. After 
forward DCT transformation, the element in the uppermost left is the DC coefficient and 
the rest elements are AC coefficients 

Step 3: Quantize DCT blocks to acquire QDB 5 and ODBK 

Step 4: Calculate the difference value from the left and right blocks based on the 
searching area. Figure | displays the searching area. Which is introduced in [23] and the 
formulation is defined as follows: 


(wy) / Searching area Anti-diagonal \, 


Fig. 1. Searching area and Anti-diagonal area in block (Triangle shape: Searching area, Circle 
shape: Anti-diagonal area) 


x+ty< 2 
Difitujtv = Yo. [QDBE)@.») — ODBE, j44(t, 99] (6) 


where u, v € (—k, k), 1 < k <5 indicates neighbor blocks; QDB(.) quantized-DCT 
coefficients. Two blocks are considered as the most similar block pair when the smallest 
different value is obtained, i.e. min(Dif i+, ;+)) < T is satisfied, where T denotes as similar 
threshold. 


Stage 3: Watermark Embedding. To improve the security of the proposed system, the 
binary watermark image is scrambled using a chaotic logistic map before inserting to 
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cover image. Moreover, to enhance the robustness of the scheme, SVD is executed on 
matrix A which is formed by two anti-diagonals of pair matched blocks. The singular 
values are modified to embed the scrambled watermark image W, which is implemented 
as follows: 


Sashes w (7) 


where, o is the robustness factor of the system. Then SVD is performed on S\, to obtain 
S; such that 


SVD(sw) = U,-S,-Vi (8) 


Next, the matrix A’ is accomplished using reverse transformation on matrices U, S1, 
and V according to the following equation 


Aa Ve (9) 


Then, coefficients of A’ have been distributed back to matched block pair and then 
apply IDCT. 

Finally, watermarked stereo image is formed in which perfect combination of IDCT 
blocks and decoding are implemented. 


3.2 Watermarking Extracting 


Once the embedding process is accomplished the watermarked image together with the 
original image and keys, watermark image will be extracted. The process includes two 
stages described as follows: 

Stage 1: This stage will be implemented similarly to the first and second stages of the 
watermarking embedding process. Result of the stage is that the matrix A’ is achieved. 

Stage 2: The watermark extraction process is realized by following steps: 

Step 1: Apply SV D(A’) to decompose it into 3 matrices U’, Me and V’: 


SVD(A’) =U'-S,,-V! (10) 
Step 2: Calculate S ‘ based on U;, V; and Ss, according to the following equation 
S,=U,-S,,-V, (11) 


Step 3: Extract watermark W’ as follows 


(12) 
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4 Performance Evaluations of the Proposed System 


In most watermarking algorithms, researchers normally consider two kinds of attacks 
on watermarking. The first one is signal processing such as filtering, additive noise and 
compression and the other is using geometric conversions such as cropping, rotating, or 
some other similar methods. However, an efficient watermarking is not only tolerant of 
the signal processing and geometric attacks but also resistant to statistical and brute-force 
attacks. Our watermarking system is an acceptable trade-off for the mentioned attacks. In 
this section, we discuss the security analysis and the robustness of the proposed scheme. 
The analysis includes key sensitive analysis and adjacent pixels correlation analysis. 


4.1 Sensitive Key Analysis 


It is known that chaotic logistic map is highly sensitive to every initial condition even 
with minute changes for an image encryption proposal, which guarantees to withstand 
a brute-force attack [24]. To evaluate the key sensitivity, we, firstly, scramble the plain 
image with two initial values x9 and jt of Eq. (8) as 0.7589 and 3.67, respectively. 
Subsequently, we change the initial value xo by adding 10° to value xo. After that, we 
rearrange the cipher image. Figure 2 displays the results of the key sensitivity. 


view of (b) Left cipherimage (c) Correct decoding (d) decoding with 


(e) Right view of (f) Right cipher (g) Correct decoding (h) decoding with 
stereo image Art image Xo+107° 


(a) Left 


stereo image Art Xot107” 


Fig. 2. The results of the key sensitivity analysis: (a) 


4.2 Adjacent Pixels Correlation Analysis 


It is known that each pixel in an image is usually highly correlated with its adjacent pixels, 
which can be exploited by hackers to find out the relationship between the plain image 
and the cipher image. Thus, in order to test the security of the system, pixels correlation 
analysis between the plain image and the cipher image are conducted. We randomly 
select 2048 pairs of adjacent pixels in a horizontal, vertical and diagonal direction from 
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the plain image as well as the cipher image. The correlation of adjacent pixels can be 
visually analyzed by plotting. Figure 3 displays the correlation distributions of the plain 
stereo image ‘Art’ with the red component before and after encoding. 


Pel ey ae on aon) 


(c) Diagonal 


Cipher image (d) Horizontal (e) Vertical (f) Diagonal 


Fig. 3. Correlation visualization for 2048 randomly selected adjacent in plain and cipher images 


Moreover, the correlation function is used to calculate the correlation of the adjacent 
pixels [25], given by: 


cov(x, y) 


pasos eee 13 
0 YD@VDO) a 
where: 
1 
covlr, 9) = = (1 — EG) x (4 — EQ) (14) 
1 
Da) ==, Gi - EW? (15) 
1 N 2 
DY) = FD .=) G1 — EO) (16) 
1 
Ea)=— oH a7) 
1 
EQ) =~ (18) 


with x;, y; are pixel values of two adjacent pixels. 

The Fig. 3 and Table | show that the correlation coefficients of the adjacent pixels in 
the plain and cipher image are far part. In other words, the system successfully transforms 
high correlation coefficients in the plain image to very low correlation coefficients in the 
cipher image. 
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Table 1. Correlation coefficients of adjacent pixels 


Art stereo image Plain image Cipher image 

Horizontal | Vertical | Diagonal | Horizontal | Vertical | Diagonal 
Left view 
Red component 0.9694 0.9648 | 0.9461 | 0.0298 0.0467 | 0.0361 
Green 0.9769 0.9834 | 0.9635 | 0.0088 —0.0034 | 0.0461 
component 
Blue component | 0.9738 0.9827 | 0.9613 | 0.0244 —0.0430 | —0.0069 
Right view 
Red component 0.9746 0.9727 | 0.9533 | 0.0397 0.0480 | 0.0173 
Green 0.9704 0.9811 | 0.9589 | 0.0111 —0.0164 | —0.0168 
component 
Blue component | 0.9741 0.9834 | 0.9619 | 0.0245 0.0129 | —0.0119 


(a) logo watermark (b) Left view of Laundry (c) Left view of Art 


Fig. 4. Logo watermark and left view of test stereo images 


4.3 Imperceptibility and Robustness Evaluation of the Watermarking System 


To evaluate the watermark imperceptibility and robustness of the system, a logo water- 
mark sized of 128 x 128 and stereo image dataset from Middlebury [26]. Figure 4 
displays a logo watermark and some of test stereo images. 

Table 2 gives the values of peak signal-to-noise (PSNR) and structural similarity 
(SSIM) index for test images. The PSNR (dB) of the watermarked stereo images is 
about 43 for W. Zhou et al.’s scheme. While the PSNR of the watermarked stereo 
images of the proposed scheme is larger than 56 (dB), which is satisfied with watermark 
transparency. Additionally, watermark transparency is also evaluated by the SSIM index 
[27]. Table 2. shows that SSIM of the proposed method is completely satisfied for 
watermark transparency. 

Bit correlation error (BCR) is normally used to evaluate the robustness of the water- 
marking method [20]. Table 3 displays the PSNR of the watermarked image after attack 
and BCR of the restored watermark. Even PSNR of the watermarked image is so slow, 
the restored watermarks are up to 86%. 
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Table 2. PSNR and SSIM of watermarked images 


Stereo images | Zhou et al.’s scheme [16] | Proposed scheme 


Left view | Right view | Left view | Right view 


Laundry | 42.80/0.978 42.79/0.977 | 56.99/0.999 | 59.39/0.999 
Art | 43.20/0.987 43.10/0.980 | 57.24/0.999 | 59.85/0.999 


Table 3. Watermark robustness to common images processing operations 


Attacks PSNR | BCR 
Salt & Pepper 0.05 | 18.35 | 0.88 
Cropping 6.25% | 20.06 | 0.86 
Gaussian 0.05 19.14 | 0.88 


5 Conclusion 


Thus, paper presents a secure and robust watermarking scheme which enhances the 
security of the system by protecting the watermark image and embedding positions. The 
performance evaluations demonstrate that the proposed scheme meets a good imper- 
ceptibility of each view in comparison with the similar state-of-the-art algorithm. The 
PSNR, SSIM metrics verified the better performance of the proposed watermarking 
scheme. 
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Abstract. In order to raise the quality of higher order mutation testing, in this 
paper, we propose an approach for effect improving of multi-objective optimiza- 
tion algorithms which can be used in the field of higher order mutation testing in 
order to reduce the number of generated mutant, generate the hard-to-kill mutant 
and construct the quality higher order mutants. We have performed an empir- 
ical evaluation with 20 real-word, open-source projects and 10 multi-objective 
optimization algorithms (including 5 original algorithms and 5 corresponding 
modification algorithms) to evaluate experimental results as well as bring out 
some opinions to effectiveness apply multi-objective optimization algorithms into 
higher order mutation testing. The study results indicate that our approach is an 
effectiveness one to get better the quality of higher order mutation testing. 


Keywords: Mutation testing - Higher order mutation testing - Quality of higher 
order mutation testing - Multi-objective optimization algorithms - Mutant 
reduction - Quality mutants 


1 Introduction 


According to IEEE definitions (IEEE Standard Glossary of Software Engineering Termi- 
nology), “software testing is the process of analyzing a software item to detect differences 
between existing and required conditions and to evaluate the features of the software 
items’’. In other words, to test software, we execute software using a designed set of test 
cases including a given set of test data to satisfy two distinct goals. The first is to demon- 
strate that designed and developed software includes all of customer requirements or 
not. Checking any situation in which behaviour of the software is incorrect, undesirable, 
or does not conform to its specification is the second goal. 

Mutation Testing (MT), a technique that has been derived from two basic ideas: 
Competent Programmer Hypothesis (“programmers write programs that are reasonably 
close to the desired program’) and Coupling Effect Hypothesis (“detecting simple faults 
will lead to the detection of more complex faults”), was originally proposed in 1970s years 
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[1, 2]. The purpose of mutation testing is to assess the quality of test cases (TCs) and 
this is synonymous with supporting the testers in creating a good set of test cases (TCs). 
A set of TCs is called as good as it can be able to expose all of the potential defects 
of program under test (PUT). In MT, mutants are the modified version of original PUT 
by changing one operator by another. The mutant is called “killed” when the outputs 
of mutant and PUT are different with the same given set of test cases, otherwise the 
mutant is called “alive” [1, 2]. According to [3, 4], three big problems of MT are: (1) 
the generated mutants is too much (but not necessary); (2) the mutants are so easy to be 
killed and (3) do not represent actual faults [3, 4]. 

Mutant reduction, generating hard-to-kill mutant as well as constructing quality 
mutants are the issues that need to be considered when using higher order mutation testing 
(HOMT). In [5—8, 10-21], multi-objective optimization algorithms, which have been 
devised for solving optimization problems and making the decisions that satisfy multiple 
objectives, were used as an effectiveness approach to overcome above-mentioned three 
big problems. 

Higher order mutation testing (also included second order mutation testing) [5- 
8, 10-22] have been considered as a promising solution for overcoming limitations of 
traditional (first order) mutation testing. However, higher order mutation testing (HOMT) 
applying is not yet widely used in software testing practice due to the quality problem 
which is necessary for further studying to improve. That is the aim which be considered 
of our research in this paper. 

The next section introduces our proposed approach and summarizes the related 
works. Section 3 presents our empirical study in detail. Section 4 informs and ana- 
lyzes the results of the experiment. The last is Sect. 5 that is the conclusions and future 
work. 


2 Proposed Approach and Related Works 


As we presented in our previous work [21], McConnell has concluded [29] that “there 
are about 1-25 errors per 1000 lines of code for delivered software and about 10-20 
defects per 1000 lines of code during in-house testing and 0.5 defects per 1000 lines of 
code in released product’. It means that, as our understanding, in the complete software 
projects versions written by programmers, who are experienced and good programmers, 
a line of code rarely has more than one defect. Derived from that statement, and with the 
aim of reducing the number of generated mutants which do not represent actual faults, we 
research and introduce an approach to modify the multi-objective optimization algorithm 
applied to construct the set of better mutants. We do not randomly combine the First 
Order Mutants (FOMs) to create HOMs. Instead of this, with the rule “apply no more 
than one mutation operator to each line of code”, we create an initial list of HOMs by 
combining two-or-more FOMs that have mutation operators at different lines of code. 
In [21], we modified the eNSGAII algorithm (the modified algorithm is named 
eNSGAII-DiffLOC) guiding by above-mentioned rule, because it is the best of all algo- 
rithms that we have used to construct the “high quality and reasonable mutants“ (named 
H7). H7 [17, 18] is a higher order mutant (HOM) which is harder to be killed than its 
constituent first order mutants (FOMs). Moreover, a set of test cases that can kill H7, 
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it also can kill all constituent FOMs of that H7. Our obtained results indicate that “The 
eNSGAII-DiffLOC seems to be slightly better than original eNSGAII algorithm in terms 
of mutant reduction, generating harder-to-kill mutant and constructing H7” [21]. 

The results in [21] were evaluated by means of comparison the following values: 


— NoM: The ratio of the number of HOMs to the number of FOMs. 

— NoT: The ratio of the number of TCs which kill HOMs to the number of TCs which 
kill constituent first order mutants of that HOMs. 

— NoR: The ratio of generated “reasonable higher order mutants” [17, 18] to all HOMs. 

— NoH7 is the ratio of H7 [17, 18] to all HOMs. 

— NoH1 is the ratio of H1 (“live (potentially equivalent) higher order mutants)” (17, 
18] to all HOMs. 


In this paper, we focus on modification 5 multi-objective optimization algorithms in 
order to perform an empirical study to apply 10 algorithms (5 original multi-objective 
optimization algorithms and 5 corresponding modification algorithms) into HOMT with 
20 real-word, open-source projects. Then, we evaluate the results based on 5 ratios: 
NoM, NoT, NoR, NoH7 and NoH1. 

The multi-objective optimization algorithms are NSGA-II, eNSGA-II, NSGA- 
Il (24, 25], eMOEA [26, 27] and SPEA-II [28] and the corresponding modifica- 
tion algorithms are named NSGAII-DiffLOC, eNSGAI-DiffLOC, NSGANI-DiffLOC, 
eMOEA-DiffLOC and SPEAII-DiffLOC respectively. 

We also use our objective and fitness functions which have been proposed by us and 
effectively applied in our previous works [17-22]. 


3 Supporting Tool and PUTs 


Judy, a mutation testing tool, (http://www.mutationtesting.org/), is a tool which has 
been developed in Java and for Java by Madeyski et al. [23]. This tool supports a 
large set of Java mutation operators for execution mutation testing such as: mutants 
generation, mutants execution and mutation analysis. In addtion, there are many built-in 
multi-objective optimization algorithms in Judy [17-22]. 

That are the main reasons to explain why, in this paper, we continue to choose Judy 
for extending and using as the supporting tool in our experiment. 

All of 20 PUTs were downloaded from the website https://github.com as well as 
http://sourceforge.net. They are 20 open source projects including the given set of TCs 
and also passed successfully their included (in each PUT) set of TCs. 

Table | presents in short 20 selected PUTs. Name of all selected PUTs (PUTs, in 
column 1), lines of code (LOC, in column 2) and number of given test cases (#TCs, in 
column 3) are the information of that table. 


4 Results Analysis 


Experimental results are shown in Table 2 and Fig. 1. Table 2 reports the mean values 
of ratios of NoM, NoT, NoR, NoH7 and NoHI (see in Sect. 2) for each selected PUTs: 
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Table 1. Projects under test (PUTs) 


No | PUTs LOC | #TCs 
1 | Antomology 1073 35 
2 | BeanBin 5925| 68 
3. | Commons-csv 9972 | 292 
4 | Commons-cli-1.3.1-sre 11375 | 247 
5 | Jettison 11690} 116 
6 |CommonsFileUploads 1.3.1 | 12321 12 
7 | Commons-email 12495 175 
8 |CommonsChain 1.2 13410 17 
9 | JWBF 13572 | 305 
10 | Commons-dbutils 15022) 228 
11 | Commons-chain-1.2-sre 17702 105 
12 | Java-util 18690 | 379 
13 | Barbecue 23996 | 190 
14 | Commons-text 25298 | 455 
15 | CommonsValidator 1.4.1 25422| 66 
16 | Commons-validator 32743 | 350 
17 | CommonsJxPath 1.3 41079 28 
18 | Commons-digester3-3.2-src | 41986] 175 
19 | CommonsLang3 3.0 122964| 126 
20 | Commons-lang3-3.4-sre 124151 | 2587 


Antomology; BeanBin; Commons-csv; Commons-cli-1.3.1-src; Jettison; Commons- 
FileUploads 1.3.1; Commons-email; CommonsChain 1.2; JWBF; Commons-dbutils; 
Commons-chain-1.2-src; Java-util; Barbecue; Commons-text; Commons Validator 1.4.1; 
Commons-validator; CommonsJxPath 1.3; CommonsJxPath 1.3; CommonsLang3 3.0; 
Commons-lang3-3.4-src named from 1 to 20 respectively in Column 1. These mean 
values are average number of original multi-objective optimization algorithms (in row 
named Original) and modification algorithms (in row named DiffLOC) for each PUT. 

In column 2 of Table 2, the term “Original” represents the 5 original multi- 
objective optimization algorithms (NSGA-II, eNSGA-I, NSGA-IT, eMOEA and 
SPEA-ID), while the term “DiffLOC” represents 5 corresponding modification algo- 
rithms (NSGAII-DiffLOC, eNSGAII-DiffLOC, NSGANI-DiffLOC, eMOEA-DiffLOC 
and SPEAII-DiffLOC). 

Figure | illustrates the comparison between the values of the above-mentioned ratios 
(NoM, NoT, NoR, NoH7, NoH1) of HOMT execution applying original algorithms and 
modification algorithms. 
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Table 2. The mean ratios of NoM, NoT, NoR, NoH7, NoH1 (%) 


PUT | Algorithm | NoM | NoT | NoR | NoH7) NoH1 
1 Original | 16.22 | 64.33 | 57.74 | 6.43 | 28.67 
DiffLOC | 14.47 | 61.21 | 69.02 | 6.98 | 30.02 
2 Original | 17.16 | 63.48 | 56.78 6.12 | 30.34 
DiffLOC | 15.34 | 60.12 | 58.04 | 6.45 | 33.67 
3 Original | 16.76 | 62.17 | 58.92 | 7.02 | 30.21 
DiffLOC | 13.99 | 59.89 | 59.75 | 7.38 | 33.42 
4 Original | 15.79 | 65.74 | 57.56 | 6.24 | 31.95 
DiffLOC | 14.01 | 61.05 | 59.01 | 6.78 | 34.50 
5 Original | 16.87 | 63.45 | 54.56 6.10 | 32.73 
DiffLOC | 14.98 | 60.12 | 58.23 | 6.53 | 34.02 
6 Original | 15.95 | 64.94 | 60.12 | 6.22 | 32.11 
DiffLOC | 15.75 | 62.36 | 62.04 | 7.01 | 34.35 
7 Original | 16.02 | 63.18 | 58.53 | 6.03 | 29.54 
DiffLOC | 14.56 | 60.23 | 59.84 | 6.78 | 32.87 
8 Original | 16.78 | 64.58 | 53.78 | 6.17 | 31.98 
DiffLOC | 15.43 | 61.29 | 55.92 | 6.28 | 33.45 
9 Original | 16.24 | 65.47 | 59.32 | 6.23 | 32.84 
DiffLOC | 14.67 | 63.92 | 61.02 | 6.89 | 34.56 
10 | Original =| 15.94 | 65.49 | 58.37 | 5.98 | 32.17 
DiffLOC | 14.22 | 62.36 | 59.98 | 6.24 | 35.23 
11 | Original =| 17.01 | 67.25 | 57.64 | 6.72 | 31.67 
DiffLOC | 15.72 | 63.57 | 59.03 | 7.04 | 32.89 
12. Original | 16.04 | 64.58 | 58.12 | 6.38 | 30.09 
DiffLOC | 14.05 | 61.87 | 59.45 | 6.84 | 34.20 
13 | Original =| 16.57 | 65.14 | 58.74 | 6.11 | 30.65 
DiffLOC | 15.02 | 60.49 | 59.83 | 6.75 | 33.57 
14 | Original =| 15.92 | 64.52 | 54.57 | 6.43 | 31.29 
DiffLOC | 14.01 | 61.04 | 57.39 | 6.95 | 34.02 
15 | Original | 16.11 | 64.68 | 60.43 6.14 | 32.36 
DiffLOC | 15.20) 62.00 | 61.71 | 6.53 | 33.98 
16 — Original | 16.90 | 63.79 | 55.89 | 6.04 | 30.73 
DiffLOC | 14.12 | 61.22 | 58.22 | 6.58 | 33.70 
17. Original =| 16.23 | 64.00 | 57.38 | 6.37 | 29.99 
DiffLOC | 15.22 | 62.39 | 59.02 | 6.92 | 32.47 
18 | Original =| 16.35 | 66.24 | 56.87 | 5.99 | 32.11 
DiffLOC | 14.32 | 61.98 | 59.01 | 6.74 | 34.05 
19 | Original | 15.77 | 65.16 | 59.34 | 6.05 | 31.59 
DiffLOC | 14.21 | 62.34 | 61.54 | 6.78 | 34.08 
20 | Original | 15.93 | 66.03 | 58.60 | 6.75 | 30.12 
DiffLOC | 13.98 | 62.54 | 62.15 | 6.94 | 33.45 
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Fig. 1. Comparison the values of NoM, NoT, NoR, NoH7, NoH1 


As we mentioned before, in this paper we focus on execution higher order muta- 
tion testing applying the original multi-objective algorithms (named EXEC1) and cor- 
responding modified algorithms (named EXEC2) to evaluate and compare the effect 
of mutant reduction, generating hard-to-kill mutant and constructing quality mutants 
between them (EXEC1 and EXEC2). 

The obtained results indicate that the modification algorithms are significantly better 
than the original algorithms, more specifically as follows: 


— The number of generated mutants of EXEC2 is smaller than the number of generated 
mutants of EXEC1 (evaluated basing on NoM). 

— The number of generated hard-to-kill mutants is increased (evaluated basing on NoT, 
NoR and NoH 1). By means of NoT, we realize that number of TCs which kill generated 
HOMs in EXEC2 is smaller than in EXEC1 (compared to the number of TCs which 
kill their constituent FOMs). The values of NoR indicate that number of of generated 
“reasonable HOMs” to al generated HOMs of EXEC2 is better than EXEC2. The 
problem of equivalent mutant is a main barrier of mutation testing [3] and so a larger 
number of H1 (“live (potentially equivalent) mutants”) (NoH1) can be seen as a 
disadvantage. However, with the carefully analyzing, we realized that H1 mutants can 
be “hard-to-kill mutants” [17—21], which are valuable. Those HOMs cannot be killed 
by the given set of test cases which are obtained in the selected project under test, but 
they are fully capable of being killed by one or more new and better (in terms of able 
to expose the potential defects of the computer programs) test cases. Therefore, more 
specific studies are needed to confirm whether these H1 mutants are really. To do this, 
we can spend a lot of time and the error prone activity may is inevitable [9]. 
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— The number of H7 (NoH7) in EXEC2 is slightly higher than in EXEC1. This leads 
to demonstration that constructing high quality and reasonable HOMs (H7) of mod- 
ified algorithms is better than corresponding original multi-objective optimization 
algorithms. 


5 Conclusions 


The quality of mutation testing in general and higher order mutation testing in particular 
is the important problem that is needed to study for applying widely mutation testing in 
software testing practice. 

In this paper, we introduced an empirical study with 20 projects under test and 5 
original multi-objective optimization algorithms as well as 5 corresponding modifica- 
tion algorithms to try for bringing out the confirmation the effectiveness of our proposed 
method in the field of higher order mutation testing applying multi-objective optimiza- 
tion algorithms. The obtained results indicate that our method is an effectiveness one 
for improving the quality of mutation testing in general in terms of mutant reduction, 
generating hard-to-kill mutants and constructing high quality, reasonable mutants. 

We know that, the 20 selected Java project in this paper may not be representative 
of Java programs in particular and other languages programs in general. In addition, the 
quality as well as coverage criteria of 20 test suites which are included in 20 PUTs are 
completely different (higher and lower) from actual projects. So, we also know that it 
hard to say for sure that our proposed solution is an absolutely good one. 

However, with the above-presented initial satisfactory results in [21] and in this 
paper, we hope that our method is one of effect solutions for improving the quality of 
applying multi-oblective algorithms into higher order mutation testing. Based on the our 
belief in useful of our study results, we continue to perform other further researchs in 
the future to confirm the correctness and effectiveness of the presented method. 


Acknowledgement. This paper is funded by Vietnam National University Ho Chi Minh City 
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Abstract. Belief merging has received much attention from the research com- 
munity with a large range of applications in Computer Science and Artificial 
Inteligence. In this paper, we represent a new belief merging approach for prior- 
itized belief bases. The main idea of this method is to use two operators, namely 
connective strong operator and averagely increasing operator to merge possibilis- 
tic belief bases. By this way, the proposed method allows to keep more useful 
beliefs, which may be lost in other methods because of drowning effect. The 
logical properties of merging result are also analyzed and discussed. 


Keywords: Belief merging - Possibilistic logic - Prioritized belief base 


1 Introduction 


The logic-based belief merging methods have attracted a lot of attention in many areas of 
computer science. In the logical model, each information source is often treated as a belief 
base and represented as a set of logical formulas. Belief merging by argumentation is 
often implemented in classical logic and possibilistic logic [1, 2]. In practice, we usually 
face with contradition, however in this case, classical logic is inapplicable. In order to 
deal with the contradition, possibilistic logic [3] is one of the most common tools. In 
possibilistic logic, each formula is attached to a weight indicating the necessity of the 
formula, which is understood as the priority of formulas. We can deduce some non-trivial 
results from a partially inconsistent belief base by using a possibilistic consequence 
relationship. 

In literature, there are some typical approaches for merging possibilistic belief bases 
such as [4—9]. In these works, a merging operator is used to calculate a new possibility 
distribution from the possibilities from given belief bases. Then, the syntax representa- 
tion of the new possibility distribution are generated [7]. Unlike propositional merging 
operators, possibilistic merging operators can lead to an inconsistent belief base even 
though the input belief bases are consistent. Further, these methods do not require that 
the input belief bases are consistent. However, they are affected by drowning effect. It 
causes to lose all information with the weight smaller than the inconsistence degree in 
the input belief bases. 
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We can face the problem of merging belief bases that an agent can have inconsistent 
belief and accept that. Many researchers proposed some methods to resolve the incon- 
sistence problem, typically [10-14]. Some works improve semantic merging rules like 
[15-17]. In these works, the merging process includes two steps, weakening the con- 
flict belief and them combining them. However, they are not recommended to integrate 
belief bases with strong conflicts, namely, the inconsistency level of their belief base is 
high. The approach based on argumentation in [1, 2, 4] is difficult to calculate to build 
arguments from an inconsistent belief base. 

As we know, there is no clear method of merging inconsistent belief bases. They 
are always assumed that the input belief base is consistent when we face with merging 
problems [10, 11, 15, 17—20]. In some case, althrought the input belief base is inconsis- 
tent, it can still preserve some useful information about the real world. Thus, the result 
of the merging process may not be required to be a consistent belief base. 

In this paper, we propose a new approach to merge priority belief bases in possibilistic 
logic by choosing an appropriate priority relationship between arguments. Our approach 
is defined semantically by the belief merging model in propositional logic. The integrated 
result is still a possibility distribution. In order to avoid useful information lost after 
merging due to the drowning effect we also propose a method based on combining 
many different operators. Moreover, this approach will also allow to merge potential 
information. We point out that our approach captures the minimal change and some 
good logical properties. 

The structure of this paper is as follows: Sect. 2 gives the introduction, we introduce 
the basic knowledge of possibilistic logic and theory of argumentation. In Sect. 3, the 
belief merging model for inconsistent prioritized belief bases is proposed. The properties 
of the belief merging operators based on combining multiple operators are discussed in 
Sect. 4. Several conclusions and future work are in Sect. 5. 


2 Preliminary 


2.1 Possibilistic Logic 


In this section, we recall about in possibilistic logic (more details can be found in [3]). 

The semantics of possibilistic logic based on the concept of possibility distribution 
zt, which is a map from the set of representations &2 to the interval [0, 1]. The possibility 
degree z (w) represents the degree of the satisfaction of representation w with the belief 
available about the real world. A possibility distribution is said to be standard if wp € Q, 
thus 7(@p9) = 1. From a possibility distribution 2, two measurements can be defined: 
the possibility degree of the formula g, IIz(g) = max{m(@) : w € Q, wF¢g} and the 
necessity degree of the formula g, Nz(~) = 1 — Tz (-¢@). 

At the syntactic level, we present each possibilistic formula by a pair (g, ~), where 
gy is a propositional formula and the weight a ¢€ [0, 1]. Itis said that the necessity degree 
of formula ¢ is greater than or equal toa, namely N(g) > a. A belief base is a finite set 
of formulas that have the form T = a{(g;, aj) : i = 1, ..., n}. The classical belief base 
combined with T is denoted T*, namely T = {yi |(¢;, a) € T}. The formulas of belief 
base T can be ordered according to their weights so thata; = 1 > a2 >--- >a, > 0. 
It is easy to see that a possibilistic belief base T is consistent if and only if its classical 


372 T. T. L. Le and T. H. Tran 


belief base T* is consistent. A possibilistic belief profile E is a multi-set of possibilistic 
belief bases. 

With a feasible belief base T, by the principle of minimum specification entropy [3], 
we obtain only one possibility distribution, denoted by zr as follows: 

For Vow € Q, 


1 néuV(gj,a;) € T,@ F gi, (1) 


mr (@) = 
T(@) 1- max{ qj :(@j,a;) € TvawF yj otherwise. 


Definition 1. Let T be a possibilistic belief base, and a € [0, 1], 


— 


a-cut of T is Toy = {¢i € T*|(y;, Bi) € Tva Bp; = a} 
2. strict a-cut of T is T.g = {¢i € T*\(g;, Bi) € TvaB; > a} 
3. The inconsistency degree of T is Inc(T) = max {a : Toa; is inconsistent} 


The inconsistency degree of T. is the largest weight a; where a;-cut of T is 
inconsistent. 


Definition 2. Let T be a possibilistic belief base, the formula ¢ is said to be the result 
of T with degree a, denoted by T Fz (g, a), if and only if (i) T>q is consistent; (11) 
Taq / ; (ili) VB > a, Tog ¥ @. 


According to Definition 2, an inconsistent belief base may still deduce non-trivial 
results. However, it is affected by drowning effect. That is, with an inconsistent belief 
base T, all formulas, whose certainty degrees are not greater than Inc(T), are com- 
pletely useless for non-trivial inference. For example, for T = {(p, 0.9), (—p, 0.8), 
(r, 0.3), (¢, 0.5)}, obviously T is equivalent to T = {(p, 0.9), (4p, 0, 8)} because of 
Inc(T) = 0.8. Therefore, (g, 0.5) and (r, 0.3) are not used in possibilitic reasoning. 

Two possibilistic belief bases T and T’ are equivalent, denoted by T = T’, if and 
only if Va € (0, 1], Toe = TL,. Two possibilistic belief profiles E = {T|, T2,..., Tn} 
and E’! = ee . va T, | are equivalent, denoted by E = E’ if and only if there exists 
a permutation o on [1,.., n] so that 7; = T’o(i) for alli € [1, .., 7]. 

Many operators have been proposed for merging possibilistic belief bases. Give a 
possibilistic belief profiles {T,, 72, ..., T,} with possible distributions {7,, oie UT ie 
the semantic combination of possibility distributions by a aggregation function © results 
in a new possibilistic distribution mg(@) = (...((71(@) ® 12(@)) 8 13(@)) ©...) ® 
Iin(w). 

The possibilitic belief base, which is the result of merging process is calculated by 
the following equation [7]: 


Tp = Kl SO Giy st) 2g SH say (2) 


where Kj is the conjunction with size j of the formulas taken from different 7; 
(Gi =1,...,m) and x; equal to 1 — a; or 1 depending on whether g; belongs to Kj; 
or not. 
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Operator should satisfy the following attributes: 


1. @(0,...,0)=0 
2. IfVi=1,...,n,a; > B; then @(a,...,@n) > O(B1,---s Br) 


2.2 Belief Merging 


Belief merging is a process to integrate the beliefs from multiple belief sources into a 
common belief base. Many approaches for belief mergings are proposed, analysed and 
discussed in past thirty years. Therefore, because of the lack of space, in this paper we 
do not discuss more about them. We just focus on the set of axioms to characterise the 
result of merging process. These axioms are stated as follows: 

Let €, £1, &2 be belief profiles and jz, (11, 42 be formulas from £ ps. A belief merging 
operator A is a map from set of belief profiles to a propositional belief base. The following 
axioms should be satisfied: 


(ICO) Ay(e) F pL. 

(IC1) If 2 is consistent then A, (€) is also consistent. 

(IC2) If Ae A pw is consistent then A,,(¢) = A eA,. 

(IC3) If ¢) =, e2and py = 2 then A,,, (€1) = Ap, (é2). 

(IC4) If K, & w and K2 + p, then Ayc({K1, K2}) A K; is consistent if and only if 
Arc({K1, K2}) A K2 is consistent. 

(IC5) Ay (e1) A Ap (ez) F Ayer U €2). 

(IC6) If A,,(€1) A A,,(€2) is consistent then A,,(€1Lle2) FA, (€1) A Ay (€2) 

(IC7) Ap (€) A M2 F Apyape (€)- 

(IC8) If A,,, (€) A 22 is consistent then Ay, rpo(e) F Ap, (€) A pe. 


3 Belief Merging for Possibilistic Logic 


In [9], authors proposed a syntax-based approach for merging a set of n consistent 
possibilistic belief bases T,,..., T,. A possibilistic belief merging operator, denoted by 
®., is a map from [0, 1]” to [0, 1] to integrate the certainty of belief from multiple belief 
sources. The result of merging process Tg is as follows: 


Te = {(%, B@1,...,4n)) : Ti Fa (G, ai)} (3) 


This method considers all the formulas in 7; even if 7; is inconsistent. Let T; = 
{(vj, a) :i =1,...,n}and Ty = {(0;, Bj): j =1,..., m} be two possibilistic belief 
bases. The merging result Jg from 7; and T> according to (3) is equivalent as follows: 


Te = {(vi, B(a;, 0)) : (Gi, aj) € T; and gy; ¢ Ty} U {(0;, @(0, B;)) : (0;, Bj) € 


T> and 0j ¢ TU {(¢i V dj, B(a;, Bj) : (gj, a;) € T; and (;, Bj) € Ty} 
(4) 
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Let T, and T> be two possibilistic belief bases, and z7@ be combined from 17, and 
mtr, based on belief merging operator ®. The belief merging result is as follows: 


T, © Tz ={(yj, 1 — (1 — a) @ 1) : (Yj, ai) € T1} U {(9;, 1 — (1.6 (1 — B,))) : (;. Bj) € To} 
U {(@ V 0;, @(a;, Bj))|Gi, a7) € Ty and (0;, Bj) € To} (5) 


When @ = min, we can easy to show that 7; @ 72 = T; UT 
Let T; and T> be two possibilistic belief bases. If operator © in (3) is maximum and 
operator ® in (5) is minimum, we have: 


Te = T, ® Tr (6) 


There are two important group of possibilistic belief merging opera- 
tors: group of conjunctive operators minimum including product(a x B) and 
linear product(max(0,a-+8—1)) and group of disjunctive operators maximum 
including probabilistic sum(a+B—a x B) and bounded sum(min(1,a + B)) 
[21]. 


Definition 6. Let ©; and @2 be two belief merging operators satisfied above properties. 
@, and @2 are dual if and only if @i(@, 6) = 1-— (1-—a@) @2 Ui — B). 

The typical dual operators are conjunctive and disjunctive operators in [22]. 

For each formula g, if (y,a@) € T; and (g,b) € To, such that a, 6 > O then 
(vy, @(@, B) € Te. On the other hand, g will be in T; @ T> in several forms, namely 
(vy, @(a@, 0)), (Y, @(O, B))va(y, @(a, B)). Obviously, (vy, @(a@, 0)) and (g, @(0, B)) are 
redundancy, we can remove them to simplify the merging result. 


Example 1. Let T; = {(g, 0.3), (0, 0.6), (4, 0.5), (A, 0.5)} and 7) = {(g,0.1), 
(0 VA, 0.7)}@(a@, 6) = min(1, a + 6) According to (4), the merging result of 7; and 
To is 


T, | @0-4), 9, 0.5), (8, 0.6), (, 0.5), (Y V2, 0.6), (PV 8,0.7), (OVA, 1), (@V IVA, D, 
i (-y Vv ava, 1) 


If we merge T; and T> by using (5) with conjunctive operator (® (a, 6B) = max0, a+ 
B — 1), the merging result is 


T, ® Tr = {(9, 0.3), (=, 0.5), (8, 0.6), (A, 0.5), (8 V 2, 0.7), (AY V AV A, 0.2)} 


In Example 1, g occurs in Zg with the weight 0.4, however it occurs in T, ® T> with 
three weights 0.3, 0.1 and 0. Formulas (¢, 0.1) and (g, 0) are redundent. Similarly 0 Vv 2 
occurs in J@ with the weight 1 and 0.7, along with it also occurs in 7; ® T> with three 
weights 0.7, 0.3 and 0.2. So we only hold the largest weighted formula in the merging 
result. 

In [23], several belief merging operators are proposed to merge two possibilistic 
belief bases by using (4). It is pointed out that maximum is suitable for conflict belief 
bases, and minimum is meaningful when the belief bases are consistent. When maximum 
is chosen, the merging result is 


T Bmax Tz = {(gi V 9, min(a;, B;)) : (Yi, a) € Tiand (0;, Bj) € To} 
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Clearly, T; ®max To is the weak merging result, i.e. too much belief is lost. 
The above belief merging approaches utilizes only one operator even if possibilistic 
belief bases are conflict. The following example is about such operator. 


Example 2. Let 7; = {(g, 0.5), (0, 0.6), (0, 0.4), (&, 0.7)} and 72 = {(-¢, 0.7), 
(g, 0.3), (0, 0.7), (&,0.5), (A, 0.4)} Two belief 0 and 4 are supported by 7; and 7 
with high degrees and they are not related to the inconsistency of 7, U 7», thus it is 
necessary to increase the possibility for them. Suppose that belief merging operator 
maximum is probabilistic sum defined by @(a, 6) = a + B —a x B According to 
(4), the merging result of T; and T> is as follows: 


i= 
{(-, 0.7), (78, 0.4), (A, 0.4), (g, 0.65), (y V 9, 0.85), (y V &, 0.75), (v V A, 0.7), 
(=9 V 4, 0.88), (y V 9, 0.72), (3, 0.88), (8 V &, 0.8), (3 V A, 0.76), (ay V 78, 0.82), 
(y V 8, 0.58), (=8 Vv &, 0.7), (49 V A, 0.64), 

(-9 V E,0.91), (y Vv &, 0.79), (8 Vv E, 0.91), (E, 0.85), (E V A, 0.82)} 


In this example, the necessities of 0 and & increase because probabilistic sum make 
the weight increased. Because formulas g and —@ are strong conflict, the necessity 
degrees of both g and —@ should be descreased. However, in the merging result, the 
necessity degree of g increases up to 0.65 and the necessity degree of —¢ is still 0.7, 
it irrational. This issue is caused by using a unique operator to merge two inconsistent 
formulas. 

Given n possibilistic belief bases {7,, 7>,..., T,} from n different sources. For the 
formulas, related to the conflicts in T; U T, U...U Ty, the necessity degrees of them 
will be decreased. In contrast, the necessity degrees of them will be increased if they are 
supported from these sources. 

Now, we introduce a belief merging operator proposed in [9]. 


Definition 3. A belief merging operator © is strongly connective on [0, 1] if 
(Q1,.-., Qn) € [0, 1] 


@(Qj,...,An) > max(ay,...,Qn) 


A strongly connective operator is rational because it is satisfied almost axioms 
introduced in [9]. If a strongly connective operator satisfies @(a1,...,Q@n) = 
max(a1,...,@,) with Va; ~ 1 and O(aj,...,@,) = 1 when Fi,a; = 1 is called 
monotonic operator. A strongly connective operator is suitable to merge conflict-free 
formulas. 


We proposed a new merging operator as follows: 


Definition 4. An operator © is an averagely increasing operator on on [0, 1] if 
(Q],..., Qn) € [0, 1] 


@(Q1,..-,An) < max(ay,..., An) 
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This operator represents that the necessity of a formula in merging result do not excess 
the maximal necessity of this formula in the input possibilistic belief bases. An example 
about averagely increasing operator is average of weight, defined by O(a, 8B) = xa+ yh, 
such that x, y € [0,1] andx + y = 1 Ifx = y = 1/2, this operator is the standard 
average operator and if x > y (or x < y) it means that the source associated with x is 
more (less resp.) reliable than the other. An averagely increasing operator is suitable for 
merging conflict-related formulas. 


Definition 5. [8] Let € be a part of a classical belief base T, € is minimal inconsistent 
set if and only if it meets the following conditions: 


e EF false and 
e VP EE,E—{g}¥F false. 


Definition 6. A formula ¢ is in a conflict in a classical belief base T if and only if it 
belongs to a minimal inconsistent set of T. The set of formulas in the conflicts in T is 
denoted by Conflict(T). 

Here, we propose a new method for belief merging based on multiple operators. We 
suppose that with a possibilistic belief bases T, if a formula ¢ is not in T*, then (g, 0) 
is added in T. 


Definition 7. Let T; = {(gj, a) :i =1,...,n} and T) = {(0;, Bj): j =1,...,m} 
be two possibilistic belief bases. Given strongly connective operator ®s; and aver- 
agely increasing operator @,q. The merging result of 7, and 7> is defined as 
A@u1.®ua (11, T2) = AU B, such that 


A = {, ®ua(@, B)|p € (Conflict(T; U Tr)", (gy, a) € Ti and (g, B) € Tr} 
B = {9, @sr(a, B)|p ¢ (Conflict(T, U T2))*, (gy, «) € T, and (g, B) € Ta} 


In Definition 7, we use two operators, a strongly connective operator and an averagely 
increasing operator, to merge possibilistic belief bases. With the conflict-free formulas 
in T; U T> we choose the strongly connective operator, and with the conflict-related 
formulas in T; U Tp, we use averagely increasing operator to merge them. 

An other important point in Definition 7 is that Ag,, 9,,,(71, 72) not only merges 
the formulas in 7; and 7> but also considers the formulas deduced from 7; and 7>. In 
this paper we study implicit belief, the belief is inferred from the input belief bases. 


Example 3 (Continue Example 2). Suppose that we use merging operators ®5;(@, 8) = 
a+ 6B —aB and @ya(a@, B) = (a + B)/2. By Definition 7, the merging result of 7; and 
T> is 


N@y;,@ua (11, T2) = {@, 0.4), (ag, 0.35), (0, 0.65), (-d, 0.2), (€, 0.6), (A, 0.4)} 


In Example 3, after merging process, the necessity degrees of both g and —@ 
are decreased, and the necessity degree of is larger than the necessity degree of 
—=qg. The necessity degrees of other formulas are similar to in Example 2. However, 
the disjunctive formulas in Example 2 does not exist in Example 3. Althrough we 
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can infer these formulas from Ag,,.¢,,,(71, 72) with smaller necessity degrees than in 
Example 2, Ag,,.o,,(T1, Tz) is simpler than Tg in Example 2. 

In Definition 7, all conflict-related formulas is weaken to obtain the smaller neces- 
sity degrees in the merging result. However, in some particular cases, it is more 
rational if the necessity degrees of some of these formulas are increased. Example, 
suppose that we have two possibilistic belief bases T; = {(g, 0.7), (0, 0.7)} and 
To = {(-@, 0.4), (g, 0.7), (0, 0.4), (&, 0.5), (A, 0.4)} from two sources. Clearly, g is 
supported by 7;. Althrough ¢ is conflict-related formula in T>, the necessity degrees of 
g is larger than —g, thus g can be considered totally by 7. Therefore, both of sources 
support g, and the necessity degrees of g should be increased. 


Definition 8. Let T be an inconsistent belief base. Formula ¢ is a conflict-free formulas 
in T and it is weakly supported by T if and only if 4(g, a) € T such that a > 6 for all 
(-g, By ET. 


Definition 9. Let T; and T> be two possibilistic belief bases. Formula ¢ is a weakly 
conflict-related formula of T, and T> if and only if g is weakly supported by 7) and 
T, respectively. The set of weakly conflict-related formulas in T; U 72 is denoted by 
Weak(T, U T2). 

Here, we define the merging method based on multiple operators as follows: 


Definition 10. Let 7; = {(yj, aj) :i = 1,...,n} and T) = {(0;, Bj): j =1,...,m} 
be two possibilistic belief bases. Given strongly connective operator ®s; and aver- 
agely increasing operator @yq. The merging result of 7, and 7> is defined as 
A@51.@ua (11, T2) = A U B, such that 


A = {(Y, ®ua(a, B))|g € (Conflict(T; U T2)\Weak(T U T2))*, (g, a) 
€ T, and (9, B) € T2}B = {(Y, st (a, B))|g ¢ (Conflict(T; U T2))*or p 
€ (Weak(T; U To))*, (y, a) € T; and (g, B) € To} 


In Ag,,.@ua(T1, T2), the necessity degrees of formulas conflict-related and not 
weakly supported by both sources will be decreased. In contrast, the necessity degrees of 
formulas conflict-free or weakly supported by both sources will be increased. Obviously, 
if Ag, @yq 18 conjunctive, the belief merging method can easily generalize to n sources. 


Example 4. Given 7; = {(g, 0.5), (0, 0.6)} and 7, = {(-9, 0.4), (g, 0.3), (0, 0.5), 
(E€,0.5), (A,0.4)}. Suppose that belief merging operators are @s;(a,8) = 
a+ B — a6 and ®yala,B) = (a+ 8)/2. Because g is weakly conflict- 
related in 7; U 72 by Definition 10, the merging result is Ag,o,,(T1, 72) = 
{(p, 0.65), (4g, 0.2), (0, 0.8), (€, 0.5), (A, 0.4)} 

Given a possibilistic belief profile € and two operator, a strongly connective operator 
and an averagely increasing operator, the merging result of our method is a possibilistic 
belief base Tg, g,,,e,,,- With the conflict-free formulas in T; U T2 we choose the strongly 
connective operator, and with the conflict-related formulas in 7; U 72, we use averagely 
increasing operator to merge them. 
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Belief merging algorithm for possibilistic belief bases based on combining multiple 
operators is as follows: 


Input: Given a possibilistic belief profile € = {T,, ..., T,,}; a strongly connec- 
tive operator ®,, and an averagely increasing operator ®,,,; belief 
merging operator Ag...@,,q (Ti +) Tn); 

Output: —_A possibilistic belief base Teg... 


Begin 
1. i=1; 
2 Te,@sr.Oua — (9,0): p € Ty (, B): 0 € T2}; a, B € [0,1] 
3. while i < n do 
4 Ag, = {P, Bse(@ Pp € (Conflict (T;U Ti41))* or p 
€ (Weak(T;U Ti+1))*, (p, @) € T; and (9, B) 
€ Ti+i} 
Bog = {GY Bual@, BY) 
€ (Conflict (T;U Ti41)\Weak(T;U T;41))”, (@, a) 
€ T; and (9,8) € Ti+1} 
A@sr@ua Ti Ti+1) = Ao, Bow, 5 
5. Fe,@5.@ua “A@st@ua Ti Ti+1); 
6. If Cn(7;) = (9; € £: 7; & pi} or Cn(Tis1) = (Pisa © £: Tiga 
Pi+1} then 
Te,@52.@ua ~ Fe,@sr@ua VY, @): P 
€ E\(T;U Tj41)}UCn(T,)UCn(T;41) 
7. Else Tg.@.,.0ua — Je Ose @uaV UP, ©): 9 € E\(TU Tis1)}: 
8. i=it+1,; 
9. end-while 
10. return Je.9..0ua 
End 


4 Logical Properties 


In this section, we examine several logical properties of merging result obtained from 
our method. 


Proposition 1. Let 7 = {7|,..., T,} be a set of n possibilistic belief bases which are 
jointly consistent, the merging result of T satisfies Eq. (3). 


Proof 
If Le T; is consistent, for any formula (¢, a;), T j+x(@, a;) if and only if there exists 
an inference for (¢, a) in Tg Therefore, the merging result of T satifies Eq. (3). 
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Proposition 2. Let T = {T),..., T,} be a set of n possibilistic belief bases, if Tg is the 
merging result of T which satisfies Eq. (4) and (7; @72 . ..@T,) is the merging result of T 
which satisfies Eq. (5), then Tg = (T, ® Ty... @ T,) and (T; ® To... ® T,)* C (Te)* 


Proof 
It is easy to see that Tg = (T; ®T2...@T,) and (T; ®@T2...®T,)* C (Z@)*. In order to 
show the reverse direction is not true, we consider the following counter example: Given 
T; = {(g, 0.7), (-@, 0.5), (A, 0.5)} and Tz = {(0, 0.6)} Because of Inc(T,) = 0.5, » 
is omited in (T; ® T2). However, we have (A, @(0.5, 0)) € Te 

Proposition 2 shows that the merging results may omit some beliefs. In the counter 
example, formula A is conflict-free with other formulas, it should not be ommited in 
merging result. In practice, A may be some important belief and it may recovered from 
Te . 

The following proposition compares Tg and (T| @ T2...® Tn). 


Proposition 3. Let T = {T,, T2,..., T,} be a set of possibilistic belief bases. If Tg is 
a merging result which satisfies Eq. (4) and (T; @ 72... ® T;,) is a merging result of 
T which satisfies Eq. (5), where @ is a dual operator @, then Tg = (T; ® T2... ® Ty) 
and Tg C (T; © T2...@T,). 

The proof of Proposition 3 is obvious, thus, we skip to give it. 


Proposition 4. Let T = {T|, TM, ..., T,} be a set of possibilistic belief bases. If Tg is a 
merging result which satisfies Eq. (4), it satisfies IC1), (C2), C4), (C5), (C6), (C7), 
(IC8) and it do not satisfy (ICO) and (IC3). 


5 Conclusion 


In this paper, we propose a model to merge prioritized belief bases in possibilistic logic. 
The main contributions are as follows: Firstly, we introduced an algorithm to obtain a 
standard possibilistic distribution from a belief profile by using different operators to 
achieve the best merging results. We use two family of operators, connective strong 
operators and averagely increasing operators to merge possibilistic belief bases. Syntax 
copies of semantic merging methods are also considered. Secondly, we pointed out that 
our method can avoid drowning effect and keep implicit beliefs. It is the preeminent 
point of this method compared to other methods. Lastly, we investigated and analysed 
the good logical properties of merging results obtained by our method. For future work, 
we will propose a set of axioms to characterise merging result and discuss about the 
complexity of our algorithm. 
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Abstract. The technique of mitigating chatter phenomenon in an effec- 
tive manner is an important issue from the viewpoint of superior quality 
machining process with quality production. In this paper, an innova- 
tive solution to control chatter vibration actively in the milling process 
is presented. The mathematical modelling associated with the milling 
technique is presented in the primary phase of the paper. In this paper, 
an innovative technique of discrete time sliding mode control (DSMC) 
is blended with Type 2 fuzzy logic system. Superior mitigation of chat- 
ter is the outcome of developed active controller. The Lyapunov scheme 
is implemented to validate the stability criteria of the proposed con- 
troller. The embedded nonlinearity in the cutting forces and damper fric- 
tion are compensated in an effective manner by the utilization of Type- 
2 fuzzy technique. The vibration attenuation ability of DSMC-Type-2 
fuzzy (DSMC-T2) is compared with the Discrete time PID (D-PID) and 
DSMC-Type-1 fuzzy (DSMC-T1) for validating the effectiveness of the 
controller. Finally, the numerical analysis is carried out to validate that 
DSMC-T2 is superior to D-PID and DSMC-T1 in the minimization of 
the milling chatter. 


Keywords: Milling chatter - Type-2 fuzzy - Sliding mode control 


1 Introduction 


Superior quality production is hampered by self-generated vibration in machining 
process. This type of vibration effects the quality of the final product and should be 
looked into with greater importance due its connection with manufacturing indus- 
tries [1]. The self-generated vibration is termed as chatter which is an important 
phenomenon that degrades the machining process thus resulting in dimensional 
inaccuracy and minimization in the removal rate of the material (MRR). Chat- 
ter phenomenon also results in low quality finish with significant tool wear [2]. 
The active control system is comprised of effective controllers, efficient sensors and 
dampers which when implemented improves the performance of machining with 
less vibration in machine tool [3]. The innovative mechanism of active damper for 
mitigation of chatter which is justified experimentally has been investigated by 
Harms et al. [4]. Chen et al. [5] developed an adaptive algorithm on the basis of 
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Fourier series for active control of chatter. Weremczuk et al. [6] proposed an algo- 
rithm to control milling chatter using active approach and harmonic excitation 
methodology. Alharbi et al. [7] implemented the concept of PID controller for chat- 
ter mitigation in milling process. Cutting forces associated with machining process 
involves nonlinearities [8]. Moradi et al. [9] demonstrated that the cutting forces 
are combination of square as well as cubic polynomial terms which are nonlinear 
in nature. In situation where the model of the milling procedure is unknown, fuzzy 
logic comes very handy and effective. Fuzzy logic has earned great research repu- 
tations due to its capacity to do nonlinear mapping thus maintaining robustness 
and simplicity. Liang et al. [10] proposed an innovative solution to control chatter 
in end milling process by utilizing the concept of fuzzy logic system. Type-2 fuzzy 
logic system performs significantly better than Type-1 fuzzy logic system due to 
its possession of additional DOF which is known as footprint of uncertainty [11]. 
The Type-2 fuzzy technique is applied to handle the unknown uncertainties, and is 
combined with the PD/PID control for vibration mitigation in the work of Paul et 
al. [12]. Sliding Mode Control (SMC) is asuperior control mechanism for vibration 
attenuation in milling tool since the SMC exhibit the same movement pattern as 
that of the tool vibration pattern. SMC is most suited for nonlinear systems due 
to its specific design criteria [13]. The innovative combined mechanism of Discrete 
Time Sliding Mode Control (DSMC) with Type 1 fuzzy system for the control of 
structural vibration is presented by Paul et al. [14]. Moradi et al. [15] investigated 
the chatter phenomenon in turning process and proposed SMC for chatter attenua- 
tion. Maet al. [16] developed an active sliding mode control strategy to mitigate the 
chatter in turning process utilizing the concept of dynamic output feedback sliding 
surface combined with an adaptive law for noise approximation. SMC has various 
stability condition criteria depending on the continuous systems. An important 
condition is [17]: 

| 5(k + 1) |<| s(&) | (1) 


in this case, s(k) = 0 which is termed as sliding surface. Another important 
condition given by Bartoszewicz et al. [18]: 


| s(k) |<g (2) 


where g is stated as quasi-sliding mode band width. Discrete Time Sliding Mode 
Control (DSMC) is an efficient controller for vibration attenuation due to its 
criteria of sampling period which is an important aspect in vibration control. 
This work is carried out by implementing the third option “active control of 
chatter”. In the first instance along x and y component, the mathematical mod- 
eling of milling process is done. Then the nonlinearities are identified for effi- 
cient compensation. The actual outcomes of Active Vibration Damper (AVD) 
was simulated using Matlab/Simulink for chatter suppression in milling process. 
The modeling is accomplished by taking into consideration the dynamics of 
AVD. Discrete Time Sliding Mode (DSMC) generates the control signals which 
is used for the suppression of chatter. DSMC is combined with Type 2 fuzzy 
logic (DSMC-T2 fuzzy) for efficient strategy. The implementation of Type 2 
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fuzzy system ensures that the nonlinearities are compensated in effective man- 
ner. The approach of Lyapunov stability analysis is implemented to prove that 
DSMC-T2 fuzzy controller is a stable one. The chatter attenuation in milling 
process is achieved by combined action of DSMC-T2 fuzzy with AVD. The wide 
significance of the concept and methodology is validated using numerical analy- 
sis. The results of DSMC-T2 fuzzy is compared with Discrete Time Sliding Mode 
Control with Type 1 fuzzy (DSMC-T1 fuzzy) and Discrete time PID (D-PID) 
to prove the effectiveness of most suited controller. 


2 Modeling of Milling Process with Active Control 


In case of milling tool with n evenly spaced teeth which is almost flexible to the 
rigid workpiece, a generalized 2-DoF mathematical model is [19]: 


MinX&c(t) + CmXe(t) + KmXc(t) = Fim (t) (3) 


the equivalent mass, damping and stiffness matrices are illustrated by the term 
Mm,Cm and Ky, respectively. 


0 Mmy 
i i oe € Rex 2 (4) 
— kn 0 2x2 
Km hee ER 


Again, x.(t) = [x y]” illustrates the displacement of the tool along z and y com- 
ponents. F,,(t) = [Fx Fy] illustrates x and y components cutting forces. The 
Fig. 1 illustrates the dynamics of milling process [20]. The closed form equations 
representing the nonlinear cutting forces along x and y components is illustrated 
as [9]: 


Fig. 1. Illustration of milling process dynamics. 
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Poa gw JG Ae + mAy® + GAr? + mAy® + GAx ee 
7 an +3, Ax? Ay + 38y2ArAy? + 2y3ArAy + y4 (5) 
pak es + nf Ay? + (3 Aa? + 3 Ay? + C3 Ax + ie) 
y an +3y} Ar? Ay + 374 ArAy? + 273 ArAy + 73 


where Axv+a(t—T) = x(t) and Ay+y(t—T) = y(t). The time delay is illustrated 
as T = 2%, Q = spindle speed (rad/s). 


2.1 Active Vibration Damper (AVD) for Active Control Mechanism 


As seen from the Fig. 2, AVD is placed on the top of the spindle and is utilized 
to minimize the tool chattering generated by the external force. The position of 
the AVD is at the centre of mass (CM). Also, it makes an inclination y with 
CM. The efficient placement of damper is a cost effective thus mitigating the 
requirement of two dampers. The combination of modeling equation and control 
force Ug yields: 


MmXc(t) + CmXe(t) + KmXc(t) = Fm(t) + uy — de (6) 


Spindle Top 


Fig. 2. AVD on Spindle Top. 


the control signals impinged to the damper along both axis is illustrated by 
Ue = [Uces Ucy] E RX! de = (der, dey] € R?*! is the combined damping- 
fricton effect resolved along two axis.It is very important to consider the damper 
friction which can be resolved as: 


deg = A&;,¢ + Pmag tanh [Ys;,.| (7) 
aay = Ads + Pag tanh [Ti 
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where A, Y and J’ are termed as damping coefficients associated with the 
Coulomb friction [21], mg = mass of the damper, #;,, and %;,, represents the rel- 
ative velocity of the damper along x component and the relative velocity of the 
damper along y component respectively. Also, de = [dex dey]’. The combination 
of the control methodology with closed loop system is given by: 


Mmakt + Cnet + kmat = Fx + ce — ALi, — Cmag tanh [1s;,.| (8) 
engl Cog > kal = Fg eg — Ae yg — 2 gg tanh [Ta | 


In Eq. (8), the nonlinear terms are A&;,, + [mag tanh [Y%;,.] , 
Ag + Pmagtanh [Y%;,,|, Fr, and Fy,. The intelligent technique is incor- 
porated to deal with the involved nonlinearities. 


3 Discrete Time Sliding Mode Control with Type-2 
Fuzzy Compensation 


The continuous time model of the milling process which is a closed loop system 
from Eq. (8) is illustrated as: 


MmXc(t) + CmXe(t) + fen(x) = F(t) + u.(t) — de (9) 


The stiffness f;,,, will be considered as nonlinear. It is very important to discretize 
the milling process model for digitalization and making it appropriate for the 
design of computer based control. For this, following steps are implemented: 
V, (t) = x, and V2 (t) = x.. The model represented by (9) is illustrated as state 
space model by: 

Z(t) =AZ+ButFrn + fn (10) 


Va(t) 0-M= Cin 
fn = M,,'[Fm+de]. Considering V(k) to be a state vector with Agis as a state 


Again, V(t) = | , Ap= k : | , Bp= Li , Fin = Mz}fen; 


ApT 


matrix, also, Agi, = e , and Baj;=input vector, Bas = ( f er”rar) Bp, 


u.(k) =scalar input, Fyn (Z(k))=model uncertainty matrix and fa,(k) =nonlin- 
earity involved in cutting forces and frictional forces, using (10), the discretized 
model is [22]: 


V(k + 1) =AgisZ(k)+Baisuc(k)+Frn(Z(k))+fan(k) (11) 


From (11), the discrete time model is: 


V(kK +1) =AaisV(k)+Baisuc(k)+Fen(Z(k))+fan(k) (12) 


From the viewpoint of preciseness and for the introduction fuzzy system to 
compensate nonlinearities, the following step is considered: 


VK + 1) =j [2(%)] +A [2(4)] wel) (13) 
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where j [z(k)] = AasV(k)+Frn(Z(k))+fan(k) and h[z(k)] = Bais, Adis and 
Bais are unknown and Fhp(Z(k))+fan(k) is nonlinear. So the term j [z(k)] and 
h|z(k)] will be modeled using Type-2 fuzzy logic technique. The Type-2 fuzzy 
logic system with jth output can be expressed as: 


i } 1 
fruzy = PEI AD = = [(6Fj(2)ura( lh) + 6h wry ((h)] (14) 
Diet fijyig+Dn Sfij yur = Diet figyrg +d Segre 


WHS Uy Sel gg Se 

. fi . fy ‘é . * 

ij = SOFAS Fae hh’ and Gry = Se Again, f/ and f, are the 
firing strengths associated with yj; and y,; of ¢-th rule. So the compensation 
technique for j [z(k)] and h[z(k)] are applied as follows: 


j= bey ROE Lz] + hry (BOF [2(K)] a 
h = 5Wrq(k) Orgl2(k)] + 5e19(k) bg12(k)] 
§2; (k) and 22 (k) satisfies the following: 
2 (k 
ee | Ty HE Nl em(k +1) [I> I emt | 
cng) Sem DN Ment) be 
the ND if lem +1) > lem) | 
0 if [lem(k+1) I< & ll em(A) I 


where 0 < §2)(k) < 1 and 0 < 9(k) < 1. Also the dead-zone parameter are 
illustrated by G, and (2. Again, 


x(k) =|] 62 -[e(A)] I? + 1] OE Le(R)}me() |? 
o(K) =I of [eC] I? +l of [z(@)}uc(6) I? (7) 


Now the modeling error e;(k) is represented as: 


em(k) = 0(k) — v(k) (18) 
where the state of the fuzzy model is represented by i(k), therefore: 
(1 + 32)0(k + 1) = j[z(k)] + A [2(k)] wc(h) (19) 


where (3; and (3 are positive constant and (3), G2 > 1 which is a design parameter. 
In case of active vibration control it considered that v4(k) = 0. The equation 
validating control error is: 

e, (k) = v4(k) — u(k) = —v(k) (20) 
The SMC can be illustrated as: 
2[KT ec(k) — E(wny(k)o% [2(k)] + wis k) OF [2(k)]) — orsign [s(k)]] 


Het [org BOF [A(R)] + wig (h)SF, 2) 
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where e,(k) = [e-(k +1—n)-- -e,(k)]" , The vector representing feedback gain 
is Kg = [kn--- ka]? € R”. The sliding mode gain and switching function are 
represented by o and s(k) respectively, where switching function is: 


s(k) = KE [oul —1) + pep eclh) (22) 
Theorem 1. /f the fuzzy model (19) is implemented for the compensation of the 
the nonlinear system mentioned by (13) with the updated laws 


wr p(k + 1) — wre(k) = —21(k)em(k) rl 2(k)| 
wy (k +1) — wig(k) = —22(k)em(k) b]p2(k)] (23) 
Wrg(k + 1) — Wrg(k) = —2(k)u (em (BOL: 2(k)| 


wig(k + 1) — wig(k) = —22(k)Uc(k)em(k) Gig 2(A)] 


then the uniform stability of the closed loop system is assured and bounded pro- 
vided that identification error €y(k) is within the range 


Qw(k)E 


Il Em (k ) |P> © (k) + Ga(k) 


(24) 
and control error satisfies 


(25) 


\le.(k Nl < < o? || Z|| (1+ , mae) 


with the gain o of the discrete-time sliding mode controller (21) establishing 


2 (G1 + B2) (26) 

> Tal | 
The above theorem validates that both the identification error and the control 
error are bounded. It justifies that the control system is a stable one. The proof 
for the Theorem 1 will be presented in the expanded version of this paper. The 
Lyapunov candidate of the form 


E(k) = EM Grp (R) J? +4 Ul ay (B) [P+ I Deg (&) IP 
Uj] dig(k) ||? + red (k)Ze.(k) (27) 


is utilized to prove the stability of the controller. 


4 Numerical Analysis 


In the first instance, the cutting conditions of milling process illustrated in [23] 
are extracted for tool vibration simulation so as to validate the effectiveness of the 
developed control mechanism. For the validation of the significant performances 
of DSMC-T2 fuzzy, the results of DSMC-T2 fuzzy is compared with DSMC- 
T1 fuzzy and Discrete time PID (D-PID). Matlab/Simulink is utilized as the 
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software environment. Simulation results are presented to validate that the tool 
chatter can be mitigated significantly by implementing AVD in combination with 
effective control system involving DSMC-T2 fuzzy. The proposed control strategy 
result is then compared with the strategies like DSMC, D-PID, DSMC-T1 fuzzy 
to prove the effectiveness of DSMC-T2 fuzzy in superior vibration mitigation. 
In this paper, for the control of chatter considering Type-2 fuzzy logic concept, 
six fuzzy rules for 7 as well as four fuzzy rules for h are sufficient for effective 
control. Considering Type-1 fuzzy logic concept, for the control of chatter, nine 
fuzzy rules for 7 and six fuzzy rules for h are sufficient. The membership functions 
are designed using Gaussian function. IF-THEN rules are applied for both the 
types of fuzzy system. The chosen learning rates are 2) = 22 = 0.9. Theorem 
1 is utilized for selecting o which is 0.17. The tool vibration attenuation along 
x-direction are presented using the plots Figs.3, 4 and 5. The results validate 
that DSMC-T2 fuzzy controller performed better than all the controller used in 
this research. 


Without Control System 


2 With Control System 
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Fig. 3. Tool vibration along z-direction using D-PID. 
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Fig. 4. Tool vibration along x-direction using DSMC-T1 fuzzy. 
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Fig. 5. Tool vibration along x-direction using DSMC-T2 fuzzy. 


5 Conclusion 


In this paper, a novel technique for milling chatter mitigation is demonstrated 
using an active control strategy. Using Lyapunov analysis technique, theorem 
is laid down to prove that the systems states of the DSMC-T2 fuzzy controller 
is bounded. The efficient approach of Type-2 fuzzy system is implemented to 
handle the nonlinearties in suitable manner. The result from numerical analysis 
establish that the most superior controller is DSMC-T2 fuzzy. In this research, 
the setup is made cost effective by installing a single AVD in an inclined position 
to control the forces along x and y axis. 
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Abstract. MapReduce has become the dominant programming model 
for analyzing and processing large-scale data. However, the model has its 
own limitations. It does not completely support iterative computation, 
caching mechanism, and operations with multiple inputs. Besides, I/O 
and communication costs of the model are so expensive. One of the most 
notably complex operations extensively and expensively used in MapRe- 
duce is recursive joins. It requires processing characteristics that are the 
limitations of a MapReduce environment. Therefore, this research pro- 
poses efficient solutions for processing recursive joins in Spark, a next- 
generation data processing engine of MapReduce. Our proposal elimi- 
nates a large amount of redundant data generated in repeated join steps 
and takes advantages of in-memory computing means and cache mech- 
anism. Through experiments, the present research shows that our solu- 
tions significantly improve the execution performance of recursive joins 
on large-scale datasets. 


Keywords: Big data processing - Recursive join - MapReduce - Spark 


1 Introduction 


In the growing up of information technology, the amount of information on the 
Internet rapidly increases. Therefore, the term “Big Data” has widely been used 
to describe a large amount of complex data on a dataset. That has also posed 
many challenges for researchers in variety fields of study, such as search-engines, 
social network analysis, web-data analysis, etc. There is a need of new distributed 
programming models running on computer clusters to process huge amount of 
data for the applications mentioned above. The idea of distributed computing is 
to divide a problem into small problems and execute each of them on a node of 
a cluster. 

MapReduce [1] was developed by Google since 2004. The motivation is to pro- 
cess huge amount of input data in a reasonable time. Nowadays, it has become a 
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popular standard model for processing large datasets on parallel and distributed 
systems. A computer cluster operate MapReduce tasks can include thousands 
of computing nodes with high fault-tolerant capacity. It is appropriate for pro- 
cessing huge amount of data on a parallel and distributed fashion. MapReduce 
is compatible with many algorithms, e.g., web-scale document analysis [1], rela- 
tional query evaluation [2], and large-scale image processing [3]. However, it is not 
designed for operations with multiple inputs. It does not directly support itera- 
tive computations for applications like PageRank [4], HITS (Hypertext Induced 
Topic Search) [5], recursive relational queries [6], data clustering [7], neural net- 
work analysis [8], social network analysis [9], and Internet traffic analysis [10]. 
These applications are involved in repeated computations on large-scale datasets 
until a fix-point is reached. Programmers must set up an iterative algorithm in 
a MapReduce environment and control iterative tasks themselves. As a result, 
the operation of reading/writing data needs to be processed many times so that 
the costs of I/O, CPU, and communication obviously increase. Those problems 
altogether propose big challenges for large-scale data processing in a MapReduce 
environment. 

Join [11,12] is a combination operation of two or more tuples on a database. 
Join is an operation that is usually used in a typical data query with expensive 
costs and complexity. There are several kinds of join: two-way join, multi-way 
join [13], chain join [14], and recursive join [15-17]. Join query on tuples become 
more complicated on Big Data world. In this research, we focus on recursive joins, 
a complex computation that has expensive performing cost but still need to be 
used in a majority of fields, e.g., PageRank, graph mining, network monitoring, 
social network, and bioinformatics. 

A typical example of recursive join is a query to discover the relationships of 
a person in social network. It is defined as follows: 


Friend(x, y) — Know(x, y); 
Friend(x, y) — Friend(x, z) ™ Know(z, y); 


A person x is a friend of a person y if x knows y. Person x is also a friend of 
y if x is a friend of z and z knows y. This is a query to find all friends of friends 
of a user. 

Obviously, processing of iterative join operations on large-scale datasets is too 
heavyweight because MapReduce is completely unaware of the nature of iterative 
calculations which waste much bandwidth, I/O, and CPU cycles. Consequently, 
we have conducted to improve it by several solutions. We first leverage the 
advantages of Spark RDD for efficient iterative computations. Then, we remove 
most non-joining data in repeated join steps to minimize the amount of data 
transmitted over the network. In addition, caching technique for a persistent 
dataset in each iteration step is also considered. 
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2 Related Works 


2.1 Recursive Join in MapReduce 


A recursive join computes the transitive closure of a relation that involves a num- 
ber of repeating join operations until a fix-point is reached [17]. In fact, there are 
many algorithms that have been introduced to solve the problem of recursively 
defined relation in traditional database such as Naive [23], Semi-Naive [1-3], 
Smart [24,25], Minimal evaluations [25], Warshall [26], and Warren [27]. How- 
ever, the algorithms mentioned above are not always appropriate to work in 
MapReduce. 

Currently, there are studies that focus on recursions in clustering environ- 
ment. Afrati et al. [27,28] have proposed a recursive evaluation on computer 
clusters to compute transitive closure for a recursive query. The research used 
two kinds of tasks: Join tasks and Dup-elim tasks. Nevertheless, the problem 
of this algorithm is that tasks recursively run for a long time will increase the 
failure rate. Moreover, it leaves an open question of minimization of data-volume 
cost since it is impressively higher than that of the linear transitive closure. 

HaLoop [29] is an extended version of Hadoop that was designed to support 
applications using MapReduce. It enhances the efficiency of these programs by an 
inner-iteration cache mechanism and a loop-aware task scheduler to improve data 
locality. Ideally, HaLoop can efficiently support processing recursions on huge 
datasets. However, HaLoop stills stay in research level and cannot be practical 
in use. Currently, HaLoop is no longer developed and supported. 

Recently, Shaw et al. [31] present an optimization solution for recursive joins 
using the Semi-Naive algorithm in a Hadoop MapReduce environment. It repeats 
two types of MapReduce tasks: a join job and a computing incremental dataset 
job. Reducer Input Cache of HaLoop is used to reduce costs of related tasks. 
Nevertheless, this solution is inevitable the limitations of HaLoop. The cost of 
read/write cache is become significant because all incremental datasets of previ- 
ous iterations need to be re-written, re-indexed, and re-read to detect duplica- 
tion. 

Therefore, in order to cover the above drawbacks, we have conducted to 
consider another framework that can give better support for caching mechanism 
and iterative computation. On that framework, we have optimized performance 
of recursive joins by eliminating non-joining data. 


2.2 Apache Spark 


Apache Spark [18] is a framework that was written in Scala language that allows 
processing huge amount of distributed data in a fast and efficient manner. It 
is compatible with many file system such as HDFS (Hadoop Distributed File 
System), Cassandra [19], HBase [20], and Amazon S3 [21]. The striking feature 
of Spark is that in comparison with Hadoop, it has the processing speed 100 
times faster in memory and 10 times faster on disk. 
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Spark’s Iterative Processing Capacity. Spark is a powerful tool to support 
iterative processing on large-scale datasets using RDD (Resilient Distributed 
Datasets). RDD is a fault-tolerant parallel data structure that can be cached in 
memory and used again for future transformations [22]. Basically, evaluation of 
RDD is lazy in nature. It means a series of transformations on an RDD is be 
delayed until it is really needed. This saves a lot of time and improves efficiency. 
The iterative processing of Hadoop MapReduce is carried out as a sequence 
of operations in which the intermediate results must be written to HDFS then 
being read as input to the following task. Meanwhile, Spark will read data from 
HDFS, perform repeated operations with RDDs, and finally write to HDFS. 


Spark’s Cache Mechanism. Due to the fact that RDD will be recalculated 
after an Action, it is costly and waste time if a RDD is being reused many 
times. Therefore, Spark provides a mechanism called persist, allowing RDDs to 
be reused efficiently. Indeed, the caching mechanism is an optimization technol- 
ogy for (iterative and interactive) Spark computations. It helps saving interim 
partial results so it can be reused in subsequent stages. These interim results as 
RDDs are thus kept in memory (default) or more solid storages like disk and/or 
replicated. 


2.3. Intersection Bloom Filter 


Bloom Filter. In 1970, Burton Bloom introduced Bloom Filter - BF [39]. It is 
a space efficient probabilistic data structure that is used to test membership of 
an element in a set. Here BF(S) is the abbreviation for a Bloom Filter with m 
bits, k independent hash functions, and a set S containing n elements. 


Intersection Bloom Filter. Intersection Bloom filter (IBF) was introduced 
by our research group in [17,32]. IBF is a probabilistic data structure that was 
designed for performing the intersection of two sets and used to denote common 
elements of sets with a probability of false positive. 


2.4 Intersection Bloom Join Algorithm 


IntersectionBloomJoin [32] is a join algorithm that has improved BloomJoin 
by using IBF instead of standard BF. This algorithm uses IBF to filter most 
of the tuples that does not participate in join. It dramatically reduces cost of 
related tasks and is proved to be more efficient than other join algorithms [17]. 
Therefore, in this research, IntersectionBloomJoin will be used for evaluating 
recursive joins. 


3 Optimizing Recursive Joins 


One of the most popular recursive join algorithms, suitable for implementation 
in a MapReduce environment, is Semi-Naive [17,30]. Besides, it is also used to 
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process on datasets that are presented in rows and columns, the problem that 
we are interested in. Hence, in this section, we present recursive join evaluation 
based on the Semi-Naive algorithm in MapReduce, then propose solutions to 
improve the evaluation in Spark. 

For the convenience of presentation, we introduce additional notations for 
this work in Table 1. 


Table 1. Notations 


Notation Description 


F The result dataset Friend(x,y) 

AF The incremental dataset 

Kk The grounding dataset Know(x,y) 

O The output of the join job of AF and K 
BF Bloom filter 

IBF Intersection Bloom filter 


To start with, we recall the traditional Semi-Naive algorithm, whose basic 
idea is quite simple and avoids recomputing already generated tuples in addition 
to returning the same result as the Naive one. As illustrated in Table2, the 
algorithm works as follows: 


— begin by assuming the result dataset F set to empty and the incremental 
dataset AF initialized to be the grounding dataset Kk; 

— repeatedly perform two activities: add the incremental dataset AF to the 
result F and recompute AF by join and difference operations using the pre- 
vious AF; 

— stop when the incremental dataset AF in the i-th iteration is empty. 


Table 2. The Semi-Naive algorithm for evaluating recursive joins 


Algorithm 1 - Semi-Naive evaluation for recursive joins 
Fo = 0, AFo = K(z,y),i = 0; (1 
2 
3 
4 
5 
6 


Repeat 
i++; 
Fy4= (AFo Was AF;-1) 
AF; = Iny(AFi-1 &z K) — Fi-a; 


) 
) 
) 
) 
) 
Until AF, 4 0 ) 


7 ee a oe 
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Recursive Join Evaluation in MapReduce. As shown in Table 2, the Semi- 
Naive algorithm is used to evaluate a recursive join by iterations of two MapRe- 
duce jobs: a join job and a difference job to compute an incremental dataset. At 
step i, the join job will perform a join operation of AF;_; and K to generate set 
O;. The second one will read O; and calculate AF; = AF ;_;- Oj; by eliminating 
duplicated tuples in O; with previous results. However, these jobs take a lot of 
costs for disk I/O operations and communication. Besides, the join job emits 
much non-joining intermediate data, thus substantially increasing the related 
costs. 


Proposed Solutions to Improve the Recursive Join Evaluation. Con- 
sidering line 5 of the Semi-Naive algorithm, AF; = Hzy(AFi-1 ™z K) — Fi-1, 
computation of the incremental dataset based on two join and difference opera- 
tions, we propose several solutions to optimize the operations in a Spark environ- 
ment. (1) Use in-memory processing capability of Spark RDD for the repeated 
operations to reduce slow disk I/O. (2) Utilize Spark’s data caching mechanism 
for a consistent dataset K to speed up the operations that access the same K 
multiple times. When data does not fit in memory, Spark will spill data to hard 
drive as this order: MEMORY_AND_DISK. (3) Eliminate non-joining data to 
reduce the read/write and communication costs related to those useless data. 
We use IBF to remove the redundant data in the join job of AF and K. The join 
algorithm used in this work is IntersectionBloomJoin [41] as mentioned above. 
Figure | presents a flowchart for processing the recursive join using the proposed 
solutions in Spark. 


Preprocessing. Before join two datasets K and AF, both will need to be trans- 
ferred to Spark data type - PairRDD and eliminate null key/value pairs. At the 
same time, we construct an intersection bloom filter IBF(K, AF). 


Iterative Evaluation of the Recursive Join. Dataset K and AF are filtered with 
the IBF(K, AF) to remove tuples without participation in join operation. Then, 
we join the filtered datasets: after each join task we process the result to create 
anew AF; the following join task will be done between K and the new AF. This 
join task will be repeated until a fix-point is reached. 

In the process of generating new datasets, we perform partitioning datasets 
for the purpose of splitting up data. We aim to avoid overflow memory, reducing 
data transmits via network, and increasing processing speed. 

Stop conditions for recursive join (fix-point): 


— Number of processing cycles reaches the maximum limitation. 

— OR AF; is empty, there is no new data generated. 

— OR intersection of two filters BF(K) and BF(AF;) is empty, new generated 
result has no data participated in join in the next iteration. 
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4 Experiment and Evaluation 


4.1 Describe Clusters and Datasets 


We conduct an experiment on a computer cluster with 14 nodes (1 master and 
13 slaves) at the Mobile Network and BigData Laboratory of Information and 
Communication Technology Faculty, Can Tho University. The configuration of 
each computer is 4 CPUs Intel Core i5 3.2 Ghz, 4GB of RAM, 500 GB of HDD, 
and Ubuntu operating system 14.04 LTS 64 bits. Versions of applications used: 
java 1.8, Hadoop 2.7.1, and Spark 1.6. 

Standard datasets generated by Purdue MapReduce Benchmarks Suite have 
size of 5GB, 10GB, 20GB, and 30GB. Data is saved as text file with each 
row has maximum 39 fields separated by a comma and each field contains 19 
characters. 


“> Read from HDFS 


a 
HDFS input 


CreatepairRDD 


Save toHDFS 
ROD pairRDD_DeltaF pairRDD_K 
| persist(MEMORY_AND_DISK) 
ee oo Ss 
| Cache F | Cache K 
Memory | Memory 
Update pairROD_DeltaF addKey addKey 
Update RDD F BF_DeltaF BF_K 


Filter(K) 


true “ 
Dedup(DeltaF) Filter(DeltaF) NIBF.isEmpty && 


i<maxiteration 


Join(K, DeltaF) 


= 


Fig. 1. Flowchart for optimized recursive join algorithm in Spark. 
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4.2 Evaluation Method 


Our work applies three approaches: the Semi-Naive algorithm in Hadoop MapRe- 
duce, the Semi-Naive algorithm in Spark, and the optimized Semi-Naive algo- 
rithm in Spark for implementing a recursive join. On each experimental dataset, 
we evaluate the approaches based on intermediate data needed to transmit over 
network and execution time. 


4.3 Evaluate the Approaches 


Here we compare the Semi-Naive evaluation in Hadoop MapReduce, the eval- 
uation in Spark, and the optimized evaluation using cache and filters in Spark 
to get the amount of intermediate data and execution time. Thereby, we can 
evaluate the level of improvement bringing by the proposed approach. 

Figure 2 shows the amount of intermediate data that is needed to transmit 
over the network of the three approaches. Figure3 presents the execution time 
of the approaches. 


Semi-Naive — Hadoop 
400,000,000 ™ Semi-Naive — Spark 
™ Semi-Naive + Cache + Filter — Spark 


350,000,000 
300,000,000 
250,000,000 
200,000,000 
150,000,000 
100,000,000 


50,000,000 
S5GB.Testl 10GB.Test2 20GB.Test3 30GB.Test3 


Fig. 2. Intermediate data (records) 


Figure 2 clearly illustrates the improvement of Spark compare to Hadoop. 
Iterative processing and partitioning on memory greatly reduce the amount of 
intermediate data transmitting over the network. Bloom filter and cache mech- 
anism continue to reduce redundantly non-joining data, which optimizes the 
recursive join in Spark. 
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9,000 Semi-Naive — Hadoop 
==te= Semi-Naive — Spark 
== Semi-Naive + Cache + Filter — Spark 


5GB.Test1 10GB.Test2 20GB.Test3 30GB.Test3 


Fig. 3. Execution time (seconds) 


Figure3 shows that Spark improves the speed of data processing through 
more effective use of memory and cache when compared with Hadoop. The 
proposed optimization helps to improve execution time for the Semi-Naive algo- 
rithm. For the optimized Semi-Naive algorithm, the larger dataset the better 
performance and small dataset is costly for processing filters. 


5 Conclusion and Future Work 


5.1 Conclusion 


A recursive join is an operation that is costly in time and resources. The research 
has provided efficient and simple solutions for evaluating recursive joins on large- 
scale datasets in Spark. The noticeable results of this work include: 


1. Investigation about available solutions for recursive joins on large-scale 
datasets in a MapReduce environment. It points out the limitations and neces- 
sity of the previous studies. 

2. Optimization for recursive joins in Spark. The solutions are proposed to effi- 
ciently compute recursive joins as follows: (a) Iterative processing mecha- 
nism on memory with Spark RDD to enhance execution performance; (b) 
Spark’s caching mechanism to cache the constant dataset over iterations and 
reduces costs of repeatedly read/write data; (c) Intersection filters and the 
Intersection Bloom join algorithm to eliminate non-joining data, significantly 
reducing read/write and communication costs related to those useless data. 

3. Experiments and evaluations for the recursive join in Hadoop MapReduce 
and Spark. 
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Through experiments, we demonstrate that the proposed solutions bring sig- 
nificant efficiency in comparison with current available solutions for recursive 
joins on large-scale datasets. Performing a recursive join in Spark, we maximum 
exploit the capacities of Spark such as distributed and parallel processing, itera- 
tive processing, caching mechanism, and fast computing on memory. Moreover, 
using Bloom filters to remove redundant elements of input datasets has alleviated 
the burden of read/write and process too much data over network. 

Optimizing recursive joins in Spark provides many benefits for a variety of 
fields, e.g., large-scale database, social network, bioinformatics, sensor network, 
network monitoring, machine learning, etc. Finally, this research is an impor- 
tant step contributes to the context of optimization management large-scale 
databases on cloud infrastructure. 


5.2 Future Work 


The work remarkably reduces the costs for read/write data and speed up the 
execution process. However, optimizing recursive joins on large-scale datasets 
in Spark still exist many limitations. To have efficiency as expectation, we are 
required to have computer clusters that are stably and strong enough to pro- 
cessing data after each iteration. Furthermore, skewed data is a big challenge for 
this research in particular and for problem of processing large-scale datasets in 
general. It is also a future work that we want to research in the near future. 
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Abstract. When producing a numerical control (NC) program for a 5-axis CNC 
machine (the so-called milling robot) to mill a sculptural surface, a constant feed 
rate value is usually assigned based on programmer’s experiences. For this reason, 
the feed rate in most of NC programs is often not optimized, it is much lower than 
maximum reachable value. To increase the productivity of the machining process, 
the feed rate in NC programs for the milling robots need to be maximized. This 
paper proposes a new feed rate optimization model, of which the objective function 
and all the kinematic constraints are transformed and expressed explicitly in a 
parametric domain which is commonly used in the tool path generation process 
performed by current CAM systems. Thus, the optimal feed rate values along 
a parametric tool path can be computed in an effective and simplified manner. 
Numerical examples demonstrate the effectiveness of the proposed method. 


Keywords: High speed milling - 5-axis milling robot - Kinematic modelling - 
Feed rate interpolation 


1 Introduction 


In recent years, the high speed 5-axis CNC machines (the milling robots) are widely used 
in manufacturing industries to machine complex parts comprised of sculptured surfaces 
such as molds, dies and impellers. A 5-axis CNC machine is similar to two cooperative 
robots, one robot carrying the tool and one robot carrying the workpiece. The structure 
diagram of a typical 5-axis CNC machine is shown in Fig. 1, and the two kinematic 
chains of the machine can be illustrated in Fig. 2. 

During the time milling a part, the two robots always cooperate and contact along 
the machining tool path. Therefore, the machines are also called the milling robots. 
In practice of programming for the machines, in order to machine a part, CAD/CAM 
systems are usually used to generate tool paths as parameterized curves (CL files). After 
that the CL files are postprocessed to produce G-code files to control the machines. In the 
postprocessing step, as usual, a conservative constant feed rate is selected for sections of 
the generated tool path or sometimes a single feed rate for the whole tool path. For this 
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Fig. 2. The kinematic chain diagram of a typical 5-axis CNC machine 


reason, the axis drive limits are never reached, which may lead to lengthy machining 
times, since the programmed feed rate is never reached by the machine tool. Therefore, 
optimizing the feed rate profile along the tool path to minimize the machining time 
without violating the limits of the drives of the machine is an important need. 

Years ago, the feed rate optimization for high speed 5-axis CNC machining is a 
known issue. However, in most of the previous investigations, the formulations of the 
feed rate optimization models were mainly expressed either in terms of the arc length 
parameter of the tool path [2-6] or in an implicit and complex form [7—13]. Note that 
the use of the arc length re-parameterization of the tool path increases the computational 
complexity when solving the feed rate optimization problem. In the researches [7-9], 
the optimization model was formulated and expressed in a parametric domain. However, 
it is difficult to determine analytically the high order derivatives of the parameter along 
with the arc length in such optimization models. Though the objective function and the 
derive constraints were derived with respect to the parametric domain [10-12], the actual 
feed rate is not presented in the formulations, it is implicitly calculated via the derivative 
of the parameter. It is clear that little attention has been paid to formulate the feed rate 
optimization problem by using explicit inverse kinematics equations and their derivatives 
expressed in the parametric domain. For this reason, in this study, a new formulation 
of the feed rate optimization model is developed. The main advantage of the proposed 
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formulation is that all constraints are transformed and explicitly expressed in terms of a 
common parameter, with the use of the Jacobian matrix and its time derivatives. Thus, 
optimal values of the feed rate along a parametric tool path which is usually generated 
by CAMs can be calculated in an effective and simplified manner, as compared with 
other previous methods. 


2 Differential Inverse Kinematics of 5-Axis CNC Machines 
in a Parametric Domain 


In this section, based on the forward kinematic equation of a general 5-axis CNC machine 
that has been derived in our previous investigations [1, 14, 15], the differential inverse 
kinematic equations are transformed in a parametric domain. These equations are of 
importance when formulating and solving the feed rate optimization problem for the 
machines. 

Let X¥ = [ x yz@ g|" denotes the position of the tool, in which (x, y, z) are the 
tool tip coordinates and ((, ¢) are the orientation angles of the tool axis in the workpiece 
coordinate system OywxXyVywZw. Letg = [a q2 93 94 95 i be a vector of the five joint 
variables of a machine. The forward kinematic equation is as follows 


X= fq © 


For example, the forward kinematic equations for the 5-axis CNC machine Spinner 
U5-620 (Fig. 3) are expressed as follows 


Fig. 3. The machine spinner U5-620 


x = XcosBcosC — Y sinC + Zsin BcosC — dsin BcosC; 


2 
y= XcosBsinC + YcosC + Zsin BcosC — dsin BcosC; (2) 


z=—XsinB+ ZcosB—dcosB+d; 
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O=B; 
g=C,; 


Note that the joint variables of the machine are denotedasg =[X Y Z B C]'; 
d is the distance from the table center point to the center line of the rotary axis B, 


when B = C = 0°. Given a parametric tool path X (4) = [ x(u) yu) Zu) Bu) Pcu) i E 
x, u € [0, 1] and a feed rate profile f(t) = s(t) where s is the arc length of the curve 


X(s) = [ x(s) y(s) zs) ] The inverse kinematic equation at position level can be 
written as follows 


Iu) = SX (3) 


To determine the inverse kinematic equations at velocity, acceleration and jerk level, 
the following mathematical transformations are introduced. 
Taking a time derivative both sides of Eq. (1) yields 


X=Jq (4) 


where J = a is the Jacobian matrix. Thus 


° dX ds , 
. -1 -1 -ly’: 
= X=J'——= X 5 
os a oe ©) 
Since, 
dx 
=~ |—Idu (6) 
d 
, dX dX/du dX/d ¢ 
yee (7) 
ds ds/du |ds/du| |X,,| 
Substituting Eq. (6) into Eq. (5) yields 
2 pa 
q=J Sf (8) 


Xx 


n 


Equation (8) is the inverse kinematic equation at velocity level. In other words, at any 
value of the parameter u, the joint (axis) velocities of a 5-axis machine can be calculated 
according to the given feed rate f, the geometrical characteristics of the desired tool path 
X,,, and the kinematic structure of a machine J. Continuously taking a time derivative 
of Eq. (4) yields 

X= Jq+44 (9) 

Thus 


i= I(¥ = ja) (10) 
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. - bse 
Since X = X,5, 
a ge lve Ue) re 
X= “ae + X,8S = X,s° + X,8 


Rewriting Eq. (6) in the following form Eq. (12) 


s, =X . 
and substituting it into Eq. (7) yields 
, xX, 
X,=—+ 
s 


Taking derivative of Eq. (13) yields 


ee (dX' /ds)s, +X, (ds’ /ds) 


u 


Note that 


Thus 


Substituting Eqs. (13 and 15) into Eq. (11) and recalculating Eq. (10) yield 


i t T ” 
a 4 (x) x ; 
ns con XxX, i( ‘ ) 2 Xx, p r: 
q=J 2 + re ‘is f-J4q 


/ 
IX, 


407 


(11) 


(12) 


(13) 


(14) 


(15) 


(16) 


(17) 


It is interesting that, in a parametric domain, the axis accelerations of a machine can 
be totally determined with Eq. (15) which depends on the feed rate square _ > and the 


feed acceleration q the geometrical characteristics of the tool path x. and X., i 


and the 


kinematic structure of a machine J and J. For the jerk calculation, the third order time 


derivative of Eq. (1) is derived as follows 
X¥=J7+25G+ 54 
Thus 
G = J-'(¥ - 244 - Ja) 


(18) 


(19) 
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Note that 


dX. d ”y Et ge my Fide e 
x=—.5 = 5x58? + X48) = (X05 + X13) (20) 
Substituting Eq. (19) into Eq. (18), and then substituting Eqs. (8, 15 and 18) into 
Eq. (17) yield 


” / T a if MW T ” / ye mw 
oP cs x, ((X,) x, +X, ((X1) Xi+(Xi) X ') 1 x 
u 
af oF ‘ieee 
xX, 


i ae (21) 


ys 245 — I 


J 
u 


Equation (21) implies that the jerk of the machine axes is calculated in a parametric 
domain as well. Note that in Eqs. (17 and 21), the terms J and J can be determined as 
follows 


5 dvi > 5 oJi5 + 
Li=0 oa qi: Ras =0 ae qi 


: Oi; : 
J= : : — 4 Qq (22) 
5 ad 5 ae 
yj =0 Iq, 4 qi: De =0 Fa 4 qi 
And, 
ad 5 5 ays + + ad 
Sa =0 aah Gidk + oar di) ~ Dio Dido Igqidee idk =P 5a, di) 
j- , 


5 Bs} Od: sos J 
Daly =0 sae Gidk + a dat gi) -- - Who(Dizo Sadqn idk a aD) 


-[a(Cloo[{], Ca . 


q 1x5 


3 Optimization of the Feed Rate 


In this section, the feed rate optimization model for 5-axis machining is formulated 
with an objective function of the feed rate maximization while respecting machine tool 
constraints. The machine tool constraints are the velocity, acceleration and jerk limits 
of all active linear and rotary feed drives. In other worlds, the feed rate optimization 
problem can be defined as the maximization of the feed rate values computed along 
the entire curved toolpath, while respecting a set of physical limits of the machine. The 
formulation of the problem is expressed as follows 


maxye<[0,1] f (4) (24) 
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Subject to 
1g(u)| < Vinax (25) 
|G(u)| < Amax (26) 
[7 w)| < Imax (27) 


where Vmax, Amax, and Jmax are vectors of the velocity, acceleration and jerk limits 
of the axes of a machine, respectively. 


T 
Vinax = Zor Vymax Vimax VOmax Vemax | 

I 
Amax = Ee AYmax AZmax Abmax ACmax i (28) 
J max = [ JIxmax J¥max IZmax Ibmax ICmax ] 


The constraints Eqs. (25-27) are calculated with Eqs. (8, 17, 21) respectively. Sup- 
pose that the parameter domain u € [0, 1] is subdivided into n intervals of equal length, 
uj, U2,...,Un, Thus, ata value u;, (i = 1 +n), the optimal value of the feed rate fj can 
be obtained by solving Eqs. (24—27). For n steps, all optimal values f;, (i = 1 +n) are 
yielded in the same manner. For example, we consider again the machine Spinner U5-620 
as shown in Fig. 3. A desired tool path needs to be cut with the machine is represented 
in the workpiece coordinate system OwXwYwZw as a Bezier curve as follows 


Xu) = Pod - u)> +3P,;d- u)-u +3P,(1— uu as P3u3 (29) 


Where Po = [0 0.05 0.05], P: =[0.10 0.05 0.15]. 


P> = [0,200.05 0.05]’ , and P3 = [0.30 0.05 0.15]’ . To machine the given tool 
path, by solving the feed rate optimization model, the optimal feed rate values along the 
path are shown in Fig. 4. Note that, for an individual machine, a maximum allowable feed 
rate fmax is usually given. The optimal value of the feed rate calculated with Eqs. (25-27) 
must satisfy fj < fmax. For the case of the machine Spinner U5-620, finax = 0.1 m/s. 


For the machine, the limits of velocities, accelerations and jerks are as follows 


Vmax = [0.25 0.25 0.25 2.5 4.0] 
Amax = [2.0 2.0 2.0 10.0 10.0] (30) 
J max = [3.0 3.0 3.0 4.36 3.49] 


It is clearly seen that, as compared with the often use constant feed rate f = 0.02 m/s, 
the feed rate values calculated in this study are significantly increased at every point along 
the tool path on the machining surface. Thus, by using this optimal feed rate profile for 
postprocessing the NC program, the machining time will be significantly reduced, and 
thereby the productivity will be increased when machining complex and big parts by 
5-axis CNC machines. 
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Fig. 4. The optimal feed rates 


4 Conclusions 


A new feed rate optimization model was formulated in a parametric domain. The main 
idea of the proposed method is based on a mapping of the differential kinematic equations 
of the 5-axis CNC machines into a parametric domain. Since, in practice, the curved tool 
path for 5-axis CNC machining is usually calculated and represented in the parametric 
domain, the use of the inverse parametric kinematic equations yielded in this study to 
investigate the behavior of the machine’s drives is advantageous and applicable. All the 
constraints formulated in this study can be evaluated effectively at every point on the 
tool path without any complex arc length re-parameterization of the cutter trajectory. 
Finally, the optimal feed rate profile for the required tool path can be obtained that plays 
an important role in machining optimization for the 5-axis CNC machines. 

Experimental investigation and validation will be the future works of the proposed 
research. 
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Abstract. Internet of Things (IoT) is an attractive part of our daily 
life. However, the development of IoT applications still faces many chal- 
lenges, such as handling unstructured data, intelligent analytics, connec- 
tivity, compatibility and interoperability, integration, and data security 
and privacy. In this paper, first, a real-time monitoring system is pro- 
posed based on opensource platform (called BKThings) and long-range 
(LoRa) communication technology. Specifically, a LoRa gateway proto- 
type is implemented to collect, store and forward monitored data from 
the IoT devices. In addition, LoRa end devices are developed, where data 
calibrations on the low-cost sensors to accurately measure environmen- 
tal and gas air parameters are integrated. Second, an opensource based 
IoT platform functional architecture is developed. The implementation 
includes main functionalists of a typical loT platform such as user man- 
agement, device registration, monitoring, connectivity, data acquisition, 
processing, storage, and visualization on the dashboards. The achieved 
results demonstrate that the tested implementation of the opensource 
IoT platform and the LoRa IoT gateway were suitable and scalable for 
sensing environmental parameters in certain data visualization. 


Keywords: Opensource - IoT platform - LoRa - Edge devices 
calibration - Real-time monitoring 


1 Introduction 


Internet of Things (IoT) technology has been added significant intelligence 
to things and communication networks that brings ordinary computers into 


This research is funded by the Ministry of Science and Technology of Vietnam under 
the framework of “research and development of information technology products for 
e-Government” (Ref: KC.01/16-20) with the research project titled “Research and 
development of Internet of Things platform (IoT), application to management of high 
technology, industrial zones”, the mission code is KC.01.17/16-20. 

© Springer Nature Switzerland AG 2020 


H. A. Le Thi et al. (Eds.): ICCSAMA 2019, AISC 1121, pp. 412-423, 2020. 
https: / /doi.org/10.1007/978-3-030-38364-0_37 


Opensource Based IoT Platform and LoRa Edge Device Calibration 413 


daily life. With the rapid development of embedded hardware platform and soft- 
ware packages, multiple sensors transmit monitored data through gateway and 
connectivity networks to a cloud platform for processing, storage and analysis. 
Based no the analytical results, the human needs can be fulfilled by changing 
and managing their environment [1,2]. 

A proposal of general requirements were released by ITU-T in 2014 [3]. Shen, 
et al. in [4] outlined the common requirements of IoT technologies and the 
individual functions in each requirement. These requirements consist of non- 
functional, applications, services, connectivity, device management, and secu- 
rity and privacy. In [5], a concept for future IoT architecture, including defini- 
tion, review of developments, various main requirements and technical design for 
enable implementation are presented. Palattella et al. [6] described the protocol 
stacks for the power-efficient PHY layer. In this regard, challenges in the design 
of a home machine-to-machine (M2M) network using IEEE 802.15.4 protocol for 
the ZigBee communication are proposed in [7]. 

Opensource cloud platform plays a crucial component in IoT. It brings valu- 
able services in various application areas. There are a number of opensource plat- 
form vendors for IoT to take advantage of specific and appropriate IoT-based ser- 
vices. Several works [8]—[13] have been investigated the application development 
of IoT solutions based on the existing opensource platforms. Various require- 
ments for both cloud and IoT integration were described in [8]. In this study, 
an agent-oriented and cloud-assisted model is considered based on a reference 
architecture. In [9], a general architecture is presented after depicting analyzing 
various studies. In this study, a smart device supported by an IoT cloud is evalu- 
ated for data collecting, processing and monitoring. A survey of sensing services 
in cloud-centric IoT and its challenges are mentioned in A brief survey in sensing 
services over cloud-centric IoT, and recent challenges are analyzed in [10]. In [11], 
a cloud IoT platform is proposed by an integration of cloud and IoT. In [12], an 
M2M remote telemetry station in cooperation with a big data processing plat- 
form and various sensors is implemented. It demonstrated the use of IoT cloud 
and data processing in disease prediction and alerting application. Wang et al. 
in [13] described different notions (i.e., data center, cloud computing, data man- 
agement across data centers, benchmark, application kernels, standards, etc.) to 
visualize how distributed IoT data could be processed in the clouds. Although 
various possible participation of these IoT open clouds, however to the best of 
knowledge, no comparative and analytical study on standardization has been 
formed through the literature. 

The contributions of this paper are as follows. First, we develop the LoRa 
gateway prototype based on open hardware platform in order to collect, store 
and forward data monitored from the end devices. We also develop the LoRa end 
devices, where data calibration is applied on the low-cost sensors to accurately 
measure environmental and gas air parameters. Second, the open source based 
IoT platform functional architecture is developed. The prototype implementa- 
tion consists of main functionalists of a typical IoT platform, including user 
management, device registration and management, data storage, processing and 
visualization. 
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Fig. 1. BKThings cloud and LoRa communications-based the monitoring systems. 
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The remainder of paper is organized as follows. Section 2 provides the over- 
all system description. The open source based IoT platform architecture for in 
according with the characteristics of loT applications is presented and discussed 
in Sect.3. The case-studied data processing and visualization results, which is 
evaluated in an environmental monitoring application scenario is presented in 
Sect.4. The paper is concluded in Sect. 5. 


2 System Description 


2.1 System Model 


Figure 1 presents the open platform-BKThings cloud and LoRaWAN for the 
monitoring systems. A gateway is implemented based on Raspberry Pi B+, 
which is responsible for bridging between server and cloud. In this paper, the 
gateway will accept the responsibility of connectivity between the end devices 
and the open platform (BKThings) Server. The gateway message received from 
the end devices follows an existing structure that will be standardized, converted, 
corrected or coded and then sent to the BKThings Client. The Client is a pre- 
designed firmware. The Client implements its algorithms at the gateway to match 
the platform before uploading data to the Server. 

BKThings Client is a software that can identify end devices and send/receive 
their data to BKThings Server. We can use the Client interface device for end 
devices to communicate with the platform. Figure 1 also shows the relationship 
between end devices, Client and Server. We can treat the client as a proxy before 
the end devices for the agent to work with the Server. Client can be any software 
application that supports IoT protocols (such as MQTT, CoAP, etc.) to exchange 
data with the Server. It is noted that we do not use the concept of device or things 
here. Because like the end device concept, a physical device can correspond to 
multiple end devices, we can also open multiple connections simultaneously from 
one device, corresponding to multiple Clients. In this paper, the MQTT protocol 
is used. It is a lightweight publish/subscribe messaging protocol using for M2M 
IoT connectivity. It is used for connection between the end devices and the 
MQTT broker and the gateway. Its advantages are small code size, minimized 


Opensource Based IoT Platform and LoRa Edge Device Calibration 415 


GATEWAY 


LoRa Module 
(Lore Module) ( GATEWAY SERVICE 


r—> Consideration 


Message from 1 
Message from Compare with 
End Device database 


LoRa receiver RaspberryPi LoRa | New device , Create new Device ——’ 


Convert and send 
data to — 
BKThings Client 


Old device 
> 


v 
BKThings Client 
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Fig. 3. The functionalized diagram and a prototype of the LoRa end device. 


data packet size, open source implementation, and can distribute packets to 
multiple applications [14]. 

As shown in the diagram of LoRa gateway (Fig. 2), the gateway receives 
packets from the end devices via the LoRa gateway. These packets will be sent 
to the gateway service through the following two steps. (1) Consideration by 
checking if the packets are properly formatted. When the end device receives 
packets that do not fit in exact structure will be rejected. Then, depending on 
the user application, the server may request end device to re-send the packets 
or skip the packets and wait for the packets on the next time. (2) Comparison 
by checking the device ID defined in the database. Parameters defined in the 
database include device ID and device type. If the device ID already exists, 
data will be packed then send it to the server. If the device is not available, 
it must be defined at the server to create a new end device. In order to end 
devices transmitting data to the platform Server, the end device works as the 
Client. In this paper, an integration of Arduino board and LoRa module is 
also implemented for the end device. It is responsible for data collection and 
calibration from the low-cost sensors. The DHT11 sensor is used to measure 
temperature and humidity. Whereas, the MQ-135 sensor is employed to estimate 
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the environmental air quality. The functionalized diagram and a prototype of the 
LoRa end device is shown in the Fig. 3. 


2.2 LoRa and LoORaWAN 


The LoRa physical layer is described in 2014 [15]. Its modulation produces chirp 
signals with the same time duration, practically [16]. In LoRa communications, 
two different types of chirps have been defined including the base chirp whose 
frequency time profile, from fimin = —BW/2 to finax = +BW/2, where BW 
being the spreading bandwidth of the signal. Modulated chirps that are cyclically 
time shifted base chirps [15]. The chirp starts with frequency f; = +BW/2 and 
ends with fo = —BW/2, is referred to as a down-chirp, practically being the 
complex conjugate of the base chirp. In each chirp, the time shift is calculated 
by multiplying the chirp itself with a down-chirp. Suppose that N is the length of 
symbol’s chirp, then there are N possible different cyclic shifts of the base chirp. 
The value of the cyclic shift is coded using log, N bits to create the spreading 
factor (SF') of the LoRa communication. LoRa also uses diagonal interleaving 
and forward error correction (FEC) codes to improve the robustness against 
noise and burst interference. The code rates (CRs) usually applied from 4/5 to 
4/8. The symbol rate (2,) and data rate (,) depend on the uses the SF and the 


4 

bandwidth, are respectively given as R, = SF (2) and R, = SF (=) ; 
BW 
where SF’, BW, and CR are the spreading factor, the bandwidth in KHz, and 
the code rate, respectively. It can be noticed that the symbol time is increased by 
increasing the SF and/or decreasing the symbol rate. In addition, increasing the 
SF less bits per symbol encoded, decreasing the data rate. The packet structure 
consists of a preamble, an optional header and the data payload. The preamble 
is used to synchronize the receiver with the transmitter. LoRa transceiver should 
know the SF to detect the preamble [15]. 

The LoRaWAN medium access control (MAC) protocol is an open source 
protocol standardized by the LoRa Alliance [17]. It provides the medium access 
control mechanism for enabling connectivity between different end devices and 
gateway(s). The network architecture of the star topology allows the end devices 
only connect to LoRaWAN gateway(s), not directly with each other. Multiple 
gateways connect to the central server. The gateways are only responsible for 
forwarding data packets from end devices towards the server encapsulating them 
in UDP/IP packets. Further, the connectivity can be terminated at the applica- 
tion servers by third parties. In this paper, the gateway is designed specifically 
for LoRa. However, the gateway can be activated with different types of con- 
nections such as Bluetooth, ZigBee, Z-Wave, Cellular, etc., by typically needs to 
install the suitable environment and the library. 
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Fig. 4. BKThings IoT platform functional architecture. 


3 The Opensource Platform 


3.1 Opensource Based IoT Platform - BKThings 


The proposed BKThings opensource platform is basically built on the M2M 
service platform in compliance with the ETSI standard. It provides RESTful 
API for exchanging XML data through unreliable connections in a distributed 
environment. It provides a modular architecture that operates on top of OSGi 
and Equinox [18] (Fig. 4). This platform provides a flexible (SCL) that can be 
deployed in M2M software modules, gateway or end devices [19]. A SCL con- 
sists of tightly linked small plugins, each plugin, providing specific functions. The 
plugin can be remotely installed, started, stopped, updated and uninstalled with- 
out reboot requirements. SCL can also detect additions or removals of services 
through the service registry and make appropriate adjustments that facilitate 
its expansion. 

The CORE plugin module provides an independent protocol service to handle 
REST requests. Mapping plugins with specific interfaces can be added to support 
protocol constraints such as HTTP and MQTT. This platform can be extended 
through specific device management mapping plugins to update device man- 
agement (DM) programs using existing protocols such as Open Mobile Alliance 
(OMA) - DM [20]. It can also be extended for different internetworking proxy 
plugins to enable communications with end devices using Bluetooth, Wi-Fi, Zig- 
Bee, Z-Wave, LoRa, and even 3G/4G/5G technologies. 

The software modules of the BK Things platform are described in Fig. 5. The 
CORE plugin executes the SC'L_Service interface to handle common REST- 
ful requests. It receives request and response of protocol-independent indication 
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Fig. 5. The software modules of BK Things platform. 


and confirmation, respectively. Resource_Controller executes resource methods 
of Create, Get, Update and Delete (CRUD). It performs necessary checks such 
as access authorization and syntax verification of resources. When a resource 
is created, updated or deleted, Event_Notifier sends notifications to all con- 
cerned subscribers. It performs the filtering process to remove unwanted events 
to a subscriber. Resource_Announcer informs the resource to the remote SCL 
explicitly so that other machines can easily access. It also handles notifica- 
tion of resource cancel process. Resource._DAO provides a discrete interface 
to encapsulate all access rights to persistent resource storage without revealing 
the database details. The Router module identifies a unique route to handle 
every request in the Resource_Controller using only the requested URI and the 
method. Request_Sender is set to hold clients dedicated to the detected proto- 
col executing the Client_Service interface. It plays as a proxy to send a general 
request through the exact protocol. Interworkproxy holds the detected inter- 
networking proxy unit (IPU) that executes the [PUgervice interface. It performs 
as a proxy to correctly call the IPU device controller. Device.Manager holds 
Remote Entity Managers (REMs) that execute the REM _Service interface and 
perform as a proxy to call the exact device manager controller. 
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Fig. 6. User management and device registration & management of BK Things. 


The OMADM_Mapping plugin software module provides bidirectional map- 
ping to manage devices that support OMA-DM. The OMADM_Monitor lis- 
tens for OMA-DM permitted devices, and calls the CORE plugin creating 
required resources based on the SCL. The OMADM_Controller executes the 
REM_Service interface to transform common requests into OMA-DM manage- 
ment. The HTT P_Mapping plugin software module supplies bidirectional links 
to the HTTP protocol. HTT P_Servlet receives and converts an HTTP request 
into a common request and invokes the CORE plugin through the SC L_Service 
interface. The HTT P_Client executes the Client_Service interface to send a 
common request using HTTP. The Phidgets_Mapping plugin software mod- 
ule supplies bidirectional mapping for interacting with older Phidgets devices. 
Phidgets_Monitor detects Phidgets devices, and calls the CORE plugin cre- 
ating required resources based on the SCL. The Phidgets_Controller executes 
the [PU_Service interface to seamlessly execute a common request through the 
Phidgets API. The DB_Driver plugin software module supplies an Object Ori- 
ented Data Base (OODB) accessible through the DB_Service interface. By doing 
the same approach, other plugin software modules can be deployed to interact 
with other additional protocols or integrate new capabilities. 


3.2. Implementation 


We had implemented, in the scope of this paper, the prototype of this archi- 
tecture to complete our IoT Testbed. The implementation prototype consists of 
major functionalities of the typical IoT platform, which includes user manage- 
ment, device registration and management, and data storage and visualization 
in the dashboard (see Fig. 6). The developed platform will be improved in the 
future to include new capabilities of utilizing the AI (Artificial Intelligence) and 
ML (Machine Learning) mechanisms in order to provide the IoT data analytical 
features that needed to provide more smart applications. 
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Algorithm 1. Calibration for the Temperature & Humidity sensor, DHT11 
Input: Cy: Calibrated data of the gas sensor, t: Temperature, H,: Humidity value, 
Ay: Current analog value read from the gas sensor, R;: the external load resistance, 
Am: the maximum analog read value, R;/Ro: Resistance ratio 

Read data from sensor pins (Av, t, and H,) 

Convert the measured values to dependency values (H;,0, and oc) 

Calculate the calibrated value of temperature and humidity, Din, using Eq. (1) 
Calculate the R;/Ro 

Calculate the calibrated value C, 


Algorithm 2. Calibration for the Gas sensor, MQ-135 
: Calculate the Ro value 

Calculate the R, value 

Read analog pins from the Gas sensor 

Collect multiple samples and calculate the average (S$) 
Calculate Ro = (S)/clean air factor 

Extrapolate coefficients a and b 

Calculate the ppm value, ppm = a(R;/Ro)b 


4 Data Processing and Visualization Results 


4.1 Data Processing 


In the implementation, characteristics of the low-cost sensor differ from each sen- 
sor production. Therefore, each sensor used in the monitoring system needs pre- 
calibration to accurately measure environmental and gas air parameters. In the 
designed system, the air temperature and humidity sensor of DHT11 and the gas 
sensor of MQ-135 are used. Temperature and humidity affect low-cost environ- 
mental sensors. Therefore, calibrated values require to be adjusted with respect 
to the these parameters to validate sensing accuracy. Algorithm 1 presents how 
to apply the auto-calibration of the DHT11 sensor. Assume that C, is the cal- 
ibrated data value of the gas sensors. It is calculated as Cy = (Rs/Ro)Din, 
where R, and Ro are the sensor resistances in the certain gas presence and 
in the clean air, respectively; D;;, is the calibrated value that depends on the 
temperature and humidity. The sensor resistances ratio can be estimated by 
Rs/Ro = Ri(Am/Av) — Ri, where R; is the external resistance. Finally, the 
calibrated value can be calculated by 


Din = Ot? —t +0, (1) 


where ¢ is the current temperature, @ and o are the temperature and humidity 
dependency values, respectively. Algorithm 2 introduces the steps for calibration 
of the Gas sensor, MQ-135. Coefficients of a and 6 are extrapolated from the 
curves provided in the sensor’s data sheet. It is noted that Ro and R, are the 
sensor resistances in the clean air and in the presence of certain gas, respectively. 
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pi@raspberrypi: ~/Lor: 


Fig. 7. Measured data collected at the LoRa gateway from end devices. 
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Fig. 8. RSSI (above) and SNR (below) at the LoRa IoT gateway. 


4.2 Data Visualization 


In order to visualize effectiveness of the monitoring system for case studies, edge- 
calibrated end devices, LoRa gateway and BK Things platform were integrated. 
As metrics to indicate the air quality index (AQJ), it is calculated by measuring 
six main air pollutants directly from the sensor MQ-135. Whereas, the tem- 
perature and humidity parameters are measured from the sensor DHT11. As 
presented in the Sect.2.2, it is a fact that when changing the SF' will change 
the receiver’s sensitivity. Therefore, increasing the ability to demodulate weak 
received signals due to long-range transmission. This means that data can be 
transmitted further by increasing the SF factor. Figure 7 shows measured pack- 
ets received at the LoRa gateway for two IoT devices. In the experimental tests 
for different data visualization, various measurement scenarios were formulated 
and shown in Figs.8, 9 and 10. In the Fig. 8, RSSI (Received Signal Strength 
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Fig. 9. Air temperature and humidity visualization in the open platform. 
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Fig. 10. Air quality index visualization in the open platform. 


Indicator) and SNR (Signal-to-Noise Ratio) parameters were measured at the 
gateway when changing the transmission distance between the IoT devices and 
the LoRa gateway for both indoor and outdoor cases. It is observed that the 
quality of indoor received signal is almost similar to outdoor conditions with 
the same distance. Then, the air temperature and humidity ware measured and 
visualized the Fig. 9. Finally, measurement and visualization of air quality index 
was performed in the Fig. 10. 


5 Conclusions 


In this paper, we developed an IoT architecture based on opensource platform 
and LoRa technology with edge calibration for IoT devices, applications for the 
air monitoring system in real-time. The designed system integrates open, license- 
free access via LoRa technology. The system’s prototype was performed with a 
small size, low-cost and easy to implement and develop in the open platform. 
The system prototype is able to monitor multiples gases, especially six main 
air pollutants, temperature and humidity. Algorithms for calibration at the loT 
devices were employed to avoid temporary errors and unnecessary data before 
transmitting to the gateway and cloud. The future work includes capabilities of 
utilizing the machine learning and big IoT data analytics in the platform that 
can provide more smart services. 
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